The 3GD system has a default configuration that allows simple usage, which in many cases might be sufficient. It is advisable to edit the ntservice.ini to match the configuration in which the sytem is to run.
- Amount of host buffers
Performance is very dependant on the number of host buffers defined. If the configured number of host buffers is higher than the actual number required, performance will suffer because even though there is no traffic assigned to the unused host buffers they still need to be serviced which will consume CPU and PCIe bandwidth.
In order to get an idea on the amount of required host buffers, one can use the Profiling tool located in the tools package. The tool is able to show how many host buffers are used and by which applications/streams.
- NUMA node
PCIe performance can be limited depending on the server layout. If the server has a configuration with multiple CPUs and with multiple IOH (I/O hub) one needs to pay attention to the NUMA nodes and assign host buffers to an adapter to the NUMA node with the shorted QPI (QuickPath Interconnect) distance.
Below is an extract from ntlog on a system with 2 NUMA nodes but only one IOH.
# /opt/napatech3/bin/ntlog | cut -d- -f2 .... PCIe settings: Supported max payload of 256 bytes. Negotiated 256 bytes PCIgen in use 2. Adapter supports PCIgen 2. Adapter support 8 lanes @ 5.0 GT/s. Link negotiated to 8 lanes @ 5.0 GT/s. PCIe measurements on NUMA node 0 RX only max expected throughput is: 29041 Mbps TX only max expected throughput is: 28259 Mbps RX combined max expected throughput is: 27406 Mbps TX combined max expected throughput is: 27821 Mbps PCIe measurements on NUMA node 1 RX only max expected throughput is: 29057 Mbps TX only max expected throughput is: 28259 Mbps RX combined max expected throughput is: 27399 Mbps TX combined max expected throughput is: 27804 Mbps ....The extract below is the same dump but on a system with 2 IOH and 2 two NUMA nodes.
# /opt/napatech3/bin/ntlog | cut -d- -f2 .... PCIe settings: Supported max payload of 256 bytes. Negotiated 256 bytes PCIgen in use 2. Adapter supports PCIgen 2. Adapter support 8 lanes @ 5.0 GT/s. Link negotiated to 8 lanes @ 5.0 GT/s. PCIe measurements on NUMA node 0 RX only max expected throughput is: 29041 Mbps TX only max expected throughput is: 28259 Mbps RX combined max expected throughput is: 27406 Mbps TX combined max expected throughput is: 27821 Mbps PCIe measurements on NUMA node 1 RX only max expected throughput is: 19057 Mbps TX only max expected throughput is: 18259 Mbps RX combined max expected throughput is: 17399 Mbps TX combined max expected throughput is: 17804 Mbps ....
- Intel Turbo Boost
Availability and frequency upside of Turbo Boost state depends upon several factors including, but not limited to, the following:
- Type of workload
- Number of active cores
- Estimated current consumption
- Estimated power consumption
- Processor temperature
The operation is dependent on headroom available in one or more cores. The amount of time the system spends in Turbo Boost mode depends on workload, environment, platform design, and overall system configuration.
Maximum frequency for processor function can't be specified. Maximum frequency is automatic and dependent on working conditions.
Due to varying power characteristics, some parts with Turbo Boost may not achieve maximum turbo frequencies when running heavy workloads and using multiple cores concurrently.
The less total power the CPU is using, the higher it can set the clocks in turbo. The more cores that are idle the better performance with Turbo Boost on one single core.
- Type of workload
- Network macros
The Network macros are by default able to run on all supported packet descriptors. Running this may might not be the most optimal way because each of the macros needs to determine the descriptor format and issue an if-then-else until the correct version is located. Below is an example of a network macro implementation to illustrate the if-then-else case:Two options exist to prevent the macros performing if-then-else:#define NT_NET_GET_PKT_HASH(_n_) _NT_NET_GET_PKT_HASH(_h_)...#if defined(_NTAPI_EXTDESCR_7_)#define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT7(_h_)#elseif defined(_NTAPI_EXTDESCR_8_)#define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT8(_h_)#elseif defined(_NTAPI_EXTDESCR_9_)#define NT_NET_GET_PKT_HASH(_h_) _NT_NET_GET_PKT_HASH_EXT9(_h_)#else#define NT_NET_GET_PKT_HASH(_h_) ((NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==7)?_NT_NET_GET_PKT_HASH_EXT7(_h_):NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==8?_NT_NET_GET_PKT_HASH_EXT8(_h_):NT_NET_GET_PKT_DESCRIPTOR_FORMAT(_h_)==9?_NT_NET_GET_PKT_HASH_EXT9(_h_):assert(0))#endif
- Define NTAPI_EXTDESCR? to match the extended descriptor in ntservice.ini. Macros not available for the choosen descriptor type will return -1.
- Use NT_NET_GET_PKT_DESCRIPTOR_FORMAT() to detect the format and use the format descriptors directly
- Amount of host buffers
- Memory limit on 32-bit systems
Systems running on 32-bits have a limited host buffer memory pool because of the 1G/3G kernel user space memory split. If the system contain multiple adapters, the ntservice.ini must be tweaked to limit the amount of host buffers required. One obvious thing to disable is the TX host buffers if TX is not utilized.
[Adapter0] AdapterType=NT4E BusId=00:0a:00:00 HostBuffersTx=[0,16,0] # 0 TX host buffers, disable the TX host buffers for adapter 0or disable RX host buffers if only TX is used
[Adapter0] AdapterType=NT4E BusId=00:0a:00:00 HostBuffersRx=[0,16,0] # 0 RX host buffers, disable the RX host buffers for adapter 0
Getting host buffer memory might not always be possible because memory tends to be fragmented over time. To avoid this, the Napatech Software Suite low level driver is in charge of allocating memory and has been designed in such way that it always tries to allocate more memory if needed but it never releases memory until unloaded. Hence it is a good idea to load the low level driver only once and never reload it to avoid memory fragmentation.
A total of 1024GB (1TB) host buffers across all configured adapters is supported by the Napatech driver.
- Linux kernel sysctl "vm.max_map_count":
Especially for large host buffer configurations it is necessary to adjust the kernel sysctl "vm.max_map_count" (/proc/sys/vm/max_map_count)
The kernel sysctl "vm.max_map_count" (/proc/sys/vm/max_map_count) should be adjusted to (at least) the total configured host buffer memory in MB multiplied by four.
Example for total host buffer size 128GB (131072MB): 131072*4 = 524288. Hence the minimum value for "vm.max_map_count" is 524288.
If your kernel is already using an even higher value - then just leave it like that.
If your application has special memory mapping needs - it should be added to this setting.
- Linux kernel sysctl "kernel.numa_balancing":
Especially for large host buffer configurations it may be necessary to disable the kernel sysctl "kernel.numa_balancing" (/proc/sys/kernel/numa_balancing)
In some large host buffer scenarios the automatic numa page balancing and page migration will not work properly and it will impose a big system-load on the system.
Symptoms seen: idle user-space threads consumes close to 100% system-time. Calls to sleep(), usleep() and nanosleep() becomes unaccurate - yet their return value is zero (uninterrupted).
Disabling the numa balancing will fix this (kernel sysctl: kernel.numa_balancing=0)
If multiple streams/applications share host buffers, they will influence each other and one application can cause the other to drop packets. To avoid one application influencing another application, the Host Buffer Allowance should be used. If sharing is done across adapters, some sort of time synchronization is needed or adapters might drift away from each other causing buffering and dropped packets. Time synchronization is configured in ntservice.ini.
On in-line adapters, be aware that as soon as an Assign command has been executed successfully, the specified data starts flowing to the destination port.
Debugging is done via ntlog, monitoring and profiling. Together these tools provide information about the warnings/errors that have occurred and the current adapter status.
If the user application crashes, NtService automatically catches the streams in use and release them. What NtService does not do is destroy StreamIds. Applications must destroy stream IDs themselves if needed. Applications can be written such that StreamIds are created external to the application and the application only opens StreamIds.
One stream ID can cover multiple host buffers and these need to be mapped into the application opening the stream ID. The mapping is done within NT_NetRxGet() but is not atomic, so host buffers will be mapped one after another. This results in NT_NetRxGet() and NT_NetRxGetNextPacket() returning data from the host buffers as they are mapped, hence merging happens as host buffers are mapped. Therefore, it seems that NT_NetRxGet() and NT_NetRxGetNextPacket only receive traffic from one host buffer, but over time traffic from all host buffers will be returned. The same behavior occurs when host buffers are assigned via NTPL after NT_NetRxOpen() has been called.
When transmitting to one port from multiple apps TXNOW must always be set by both apps. See Transmitting to one port from multiple applications using absolute TX timing for further information.
To achieve latency stability the use of affinity is needed. On linux a combination of kernel boot parameter ISOLCPUS, sched_setaffinity() and sched_setscheduler() is needed to achieve optimal latency stability. As example the following setup could be used: Set kernel boot parameter: isolcpus=2,3,6,7 Start 3GD limiting it to use core 2 and 3: taskset -pc 2,3 /opt/napatech3/bin/ntstart.sh Run application on core 6,7 with SCHED_RR