for Dell Poweredge R750 server with 1TB ram, 76 CPUs, redhat 9
check if theres packet loss
check loadavg, memory and cpu usage and disk IO (use netdata historical charts) - high metrics will generate packet loss as kernel is unable to keep up with TCP stream
check health of interface, errors/dropped/missed
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether c8:4b:d6:8c:93:36 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
843514769010 6638124689 0 0 0 20656155
TX: bytes packets errors dropped carrier collsns
6537622911210 12047395179 0 0 0 0
altname enp4s0f0
altname eno8303
check errors and stats (tx_errors, rx_errors)
check high retrasmit rates
tcp_max_reordering: Controls how tolerant the stack is to packet reordering. Default is 300; you can increase if you expect reordering (per-packet load balancing, etc.):
sysctl -w net.ipv4.tcp_max_reordering=600A higher value makes Linux more tolerant of reordering before considering a packet “lost,” which can reduce DUP ACKs when packets simply arrive out of order
tcp_moderate_rcvbuf: Enables autotuning of TCP receive buffers, generally should be enabled (value 1):
sysctl -w net.ipv4.tcp_moderate_rcvbuf=1This allows the system to adjust buffer sizes to match network conditions, potentially reducing out-of-order delivery caused by buffer exhaustion
tcp_delack_min: Controls the delayed ACK timer, but in recent Red Hat kernels defaults are aggressive (4 ms). Only adjust if you have a specific reason, as improper settings can harm performance:
sysctl -w net.ipv4.tcp_delack_min=4(Confirm support with ls /proc/sys/net/ipv4/ | grep tcp_delack_min
Enable Selective Acknowledgments (SACK): SACK improves recovery from packet loss.
sysctl -w net.ipv4.tcp_sack=1Selective Acknowledgments (SACK): SACK is a TCP feature that allows the receiver to acknowledge specific segments of data that were received correctly, even if some segments are missing (e.g., due to packet loss). Without SACK, TCP uses cumulative acknowledgments, which may force retransmission of more data than necessary.
Improved Recovery from Packet Loss: SACK reduces unnecessary retransmissions by informing the sender exactly which segments were received, allowing the sender to retransmit only the lost segments.
Better Performance in Lossy Networks: SACK is particularly beneficial in high-latency or congested networks, where packet loss is more common.
Reduced TCP DUP ACKs: By improving packet loss recovery, SACK can decrease the frequency of Duplicate ACKs, as the receiver can acknowledge out-of-order segments more precisely, reducing the need for repeated ACKs.
Adjust TCP Window Scaling: Increase the TCP window size to handle high-latency or high-bandwidth networks.
sysctl -w net.ipv4.tcp_window_scaling=1sysctl -w net.core.rmem_max=16777216sysctl -w net.core.wmem_max=16777216sysctl -w net.ipv4.tcp_rmem='4096 87380 16777216'sysctl -w net.ipv4.tcp_wmem='4096 65536 16777216'Enable Fast Recovery Algorithms: Use modern congestion control algorithms like CUBIC or BBR.
sysctl -w net.ipv4.tcp_congestion_control=cubicIncrease Backlog Queue: Prevent drops during bursts of traffic.
sysctl -w net.core.somaxconn=65535sysctl -w net.ipv4.tcp_max_syn_backlog=8192Reduce Keepalive Overhead: Adjust TCP keepalive settings to minimize unnecessary traffic.
sysctl -w net.ipv4.tcp_keepalive_time=7200sysctl -w net.ipv4.tcp_keepalive_intvl=75sysctl -w net.ipv4.tcp_keepalive_probes=9Update NIC firmware and ensure you are using the latest drivers
Disable offload features (for testing): Sometimes, features like generic receive offload (GRO), large receive offload (LRO), or TCP segmentation offload (TSO) can cause reordering. Test by toggling with ethtool.
If this reduces DUP ACKs, you may need deeper troubleshooting with NIC vendor support
Tune Interrupt Coalescing: Adjust interrupt coalescing to balance latency and throughput.
ethtool -C em1 rx-usecs 50Interrupt Coalescing: Network interface cards (NICs) generate interrupts to notify the CPU of incoming or outgoing packets. Interrupt coalescing reduces the number of interrupts by grouping multiple packets and delaying interrupts for a short period, improving CPU efficiency and throughput.
rx-usecs 50: This sets the time the NIC waits (50 microseconds) before generating an interrupt for received packets. During this period, the NIC buffers incoming packets and processes them in a single interrupt, reducing CPU overhead.
RX PARAMETERS
rx-usecs - Number of microseconds to delay an RX interrupt after packet arrival. If 0, only rx-max-frames is used. 
Do not set both rx-usecs and rx-max-frames to zero as this would cause RX interrupts to never be generated.
rx-usecs-low - Same as rx-usecs, but used in concert with pkt-rate-low (see below).
rx-usecs-high - Same as rx-usecs, but used in concert with pkt-rate-high (see below).
rx-max-frames - Number of packets to delay an RX interrupt after packet arrival. If 0, only rx-usecs is used. 
Do not set both rx-usecs and rx-max-frames to zero as this would cause RX interrupts to never be generated.
rx-max-frames-low - Same as rx-max-frames, but used in concert with pkt-rate-low (see below).
rx-max-frames-high - Same as rx-max-frames, but used in concert with pkt-rate-high (see below).
rx-usecs-irq - Number of microseconds to delay an RX interrupt after packet arrival while the host is also servicing an IRQ. Some NIC drivers may not support this feature.
rx-max-frames-irq - Number of packets to delay an RX interrupt after packet arrival while the host is also servicing an IRQ. Some NIC drivers may not support this feature.
TX PARAMETERS
tx-usecs - Number of microseconds to delay a TX interrupt after a sending a packet. If 0, only tx-max-frames is used. 
Do not set both tx-usecs and tx-max-frames to zero as this would cause TX interrupts to never be generated.
tx-usecs-low - Similar to tx-usecs, but used in concert with pkt-rate-low (see below).
tx-usecs-high - Similar to tx-usecs, but used in concert with pkt-rate-high (see below).
tx-max-frames - Number of packets to delay a TX interrupt after sending a packet. If 0, only tx-usecs is used. 
Do not set both tx-usecs and tx-max-frames to zero as this would cause TX interrupts to never be generated.
tx-max-frames-low - Similar to tx-max-frames, but used in concert with pkt-rate-low (see below).
tx-max-frames-high - Similar to tx-max-frames, but used in concert with pkt-rate-high (see below).
tx-usecs-irq - Number of microseconds to delay a TX interrupt after sending a packet while the host is also servicing an IRQ. Some NICs may not support this feature.
tx-max-frames-irq - Number of packets to delay a TX interrupt after sending a packet while the host is also servicing an IRQ. Some NICs may not support this feature.
OTHER PARAMETERS
adaptive-rx - An algorithm to improve rx latency at low packet-receiving rates and improve throughput at high packet-receiving rates. Some NIC drivers do not support this feature.
adaptive-tx - An algorithm to improve tx latency at low packet-sending rates and improve throughput at high packet-sending rates. Some NIC drivers do not support this feature.
pkt-rate-low - Rate of packets-per-second below which a different set of *-usecs and *-max-frames parameters are used:
rx-usecs-low
rx-max-frames-low
tx-usecs-low
tx-max-frames-low
Above this rate, the normal *-usecs and *-max-frames parameters are used.
pkt-rate-high - Rate of packets-per-second above which a different set of *-usecs and *-max-frames parameters are used:
rx-usecs-high
rx-max-frames-high
tx-usecs-high
tx-max-frames-high
Below this rate, the normal *-usecs and *-max-frames parameters are used.
sample-interval - Number of seconds to use as packet sampling rate for adaptive coalescing. Must be non- zero.
Some applications generate small, rapid-fire packets that can increase DUP ACKs in congested or high-latency environments. Where possible, optimize applications to batch sends or reduce burstiness.
check how many packets are being dropped per iface
column -t /proc/net/dev
TCP Receive Queue and netdev_max_backlog
Each CPU core can hold a number of packets in a ring buffer before the network stack is able to process them. If the buffer is filled faster than TCP stack can process a packet, a dropped packet counter is incremented and the packet is dropped. The net.core.netdev_max_backlog setting should be increased to maximize the number of packets queued for processing on servers with high burst traffic.
net.core.netdev_max_backlog is a per CPU core setting.
TCP Backlog Queue and tcp_max_syn_backlog
The TCP Backlog Queue holds incomplete connections waiting to complete.
A connection is created for any SYN packets that are picked up from the receive queue and are moved to the SYN Backlog Queue. The connection is marked “SYN_RECV” and a SYN+ACK is sent back to the client.
These connections are not moved to the accept queue until the corresponding ACK is received and processed.
The maximum number of connections in the queue is set in the net.ipv4.tcp_max_syn_backlog kernel setting.
Under normal load, the number of SYN backlog entries should be no higher than 1 under normal load, and should remain below the tcp_max_syn_backlog limit under heavy load. To check the current size of a TCP port’s SYN backlog, run the following command (example uses TCP port 80):
If there are a high number of connections in the “SYN_RECV” state, this can cause problems on a server taking high volume traffic. Before increasing this limit, it may be possible to reduce the time a SYN packet sits in this queue by tuning related TCP settings.
SYN Cookies
Tuning this can decrease the duration a SYN packet sits in the receive queue. If SYN cookies are not enabled, the client will simply retry sending a SYN packet.
If SYN cookies are enabled (net.ipv4.tcp_syncookies), the connection is not created and is not placed in the SYN backlog, but a SYN+ACK packet is sent to the client as if it was. SYN cookies may be beneficial under normal traffic, but during high volume burst traffic some connection details will be lost and the client will experience issues when the connection is established. There’s a bit more to it than just the SYN cookies, but here’s a write up called “SYN cookies ate my dog” written by Graeme Cole that explains in detail why enabling SYN cookies on high performance servers can cause issues.
SYN+ACK Retries
Tuning this can significantly decrease the duration a SYN packet sits in the receive queue. What happens when a SYN+ACK is sent but never gets a response ACK packet? In this case, the network stack on the server will retry sending the SYN+ACK. The delay between attempts are calculated to allow for server recovery.
If the server receives a SYN, sends a SYN+ACK, and does not receive an ACK, the length of time a retry take follows the Exponental Backoff algorithm and therefore depends on the retry counter for the number of attempts.
The kernel setting that defines the number of SYN+ACK retries is net.ipv4.tcp_synack_retries with a default setting of 5.
This will retry at the following intervals after the first attempt: 1s, 3s, 7s, 15s, 31s. The last retry will timeout after roughly 63s after the first attempt was made, which corresponds to when the next attempt would have been made if the number of retries was 6.
This alone can keep a SYN packet in the SYN backlog for more than 60 seconds before the packet times out. If the SYN backlog queue is small, it doesn’t take a large volume of connections to cause an amplification event in the network stack where half-open connections never complete and no connections can be established. Set the number of SYN+ACK retries to 0 or 1 to avoid this behavior on high performance servers.
SYN Retries
Tuning this can significantly decrease the duration a SYN packet sits in the receive queue. Although SYN retries refer to the number of times a client will retry sending a SYN while waiting for a SYN+ACK, it can also impact high performance servers that make proxy connections.
An nginx server making a few dozen proxy connections to a backend server due to a spike of traffic can overload the backend server’s network stack for a short period, and retries can create an amplification on the backend on both the receive queue and the SYN backlog queue. This, in turn, can impact the client connections being served. The kernel setting for SYN retries is net.ipv4.tcp_syn_retries and defaults to 5 or 6 depending on distribution. Rather than retry for upwards of 63–130s (exponential backoff), limit the number of SYN retries to 0 or 1.
See the following for more information on addressing client connection issues on a reverse proxy server
TCP Accept Queue and somaxconn
Applications are responsible for creating their accept queue when opening a listener port when callinglisten() by specifying a “backlog” parameter. As of linux kernel v2.2, this parameter changed from setting the maximum number of incomplete connections a socket can hold to the maximum number of completed connections waiting to be accepted. As described above, the maximum number of incomplete connections is now set with the kernel setting net.ipv4.tcp_max_syn_backlog.
The TCP listen() backlog
Although the application is responsible for the accept queue size on each listener it opens, there is a limit to the number of connections that can be in the listener’s accept queue. There are two settings that control the size of the queue:
A backlog parameter on the TCP listen() call made from the application
A kernel limit maximum from the kernel sysctl: net.core.somaxconn
Accept Queue Default
The default value for net.core.somaxconn comes from the SOMAXCONN constant, which is set to 128 on linux kernels up through v5.3, while SOMAXCONN was raised to 4096 in v5.4. However, v5.4 is the most current version at the time of this writing and has not been widely adopted yet, so the accept queue is going to be truncated to 128 on many production systems that have not modified net.core.somaxconn.
Applications typically use the value of the SOMAXCONN constant when configuring the default backlog for a listener if it is not set in the application configuration, or it’s sometimes simply hard-coded in the server software. Some applications set their own default, like nginx which sets it to 511 — which is silently truncated to 128 on linux kernels through v5.3. Check the application documentation for configuring the listener to see what is used.
To check the accept() queue size that is configured for open TCP listener ports, run the following command (example port 80):
ss -plnt sport = :80|catAccept Queue Maximum
The maximum value for the net.core.somaxconn is 65535 in kernels v2.2 through v4.0.x, and 4294967295 in kernels v4.1.0+.
Accept Queue Override
Many applications allow the accept queue size to be specified in the configuration by providing a “backlog” value on the listener directive or a configuration that will be used when calling listen(). Example, nginx has a backlog parameter that can be added to the listen directive that can be used to adjust the size of the accept queue for the listener port:
If an application calls listen() with a backlog value larger than net.core.somaxconn, then the backlog for that listener will be silently truncated to the somaxconn value.
Application Workers
If the accept queue is large, also consider increasing the number of threads that can handle accepting requests from the queue in the application. For example, setting a backlog of 20480 on an HTTP listener for a high volume nginx server without allowing for enough worker_connections to manage the queue will cause connection refused responses from the server.
File Descriptors (file handles, connections)
On linux systems, everything is a file. This includes, actual files and folders, symlinks, pipes, and sockets among others. Because of this, configuring the maximum number of connections for a process also requires configuring the number of files a process can open.
Every socket in a connection also uses a file descriptor.
----------------------------------------------------------------------
Open Files System Limit
The maximum number of all file handles that can be allocated to the system is set with the kernel setting fs.file-max.
The fs.file-max setting is the total maximum number of file handles that can be allocated and used on a system.
To see the current number of file descriptors allocated and the max allowed, cat the following file:
# cat /proc/sys/fs/file-nr1976 0 2048The output shows that the number of file descriptors in use is 1976, the number of allocated but free file descriptors is 0 (this will always show “0” on kernel v2.6+ meaning used and allocated always match), and the maximum is 2048. On a high performance system, this should be set high enough to handle the maximum number of connections and any other file descriptor needs for all processes on the system. 2048 is very low for this kind of system, and 1976 is very close to the maximum.
Open Files Process Limit
The maximum number of files that can be opened by a single process is governed by the kernel setting fs.nr_open. This setting should be no larger than one third of fs.file-max. By default, fs.nr_open should be large enough for any single process running on a system without needing to adjust it.
The fs.nr_open setting is the maximum value that can set for the “number of open files”, or nofile, user limit
.
Open Files User Limit
In addition to the file descriptor system and process limits, each user is limited to a maximum amount of open file descriptors. This is set with the system’s limits.conf (nofile), or in the processes systemd unit file if running a process under systemd (LimitNOFILE). To see the maximum number of file descriptors a user can have open by default:
ulimit -n1024And under systemd, using nginx as an example:
systemctl show nginx | grep LimitNOFILE4096There are many guides available to explain how to make these settings work for a file system needy process. This is a detailed approach that has worked on high volume systems and should work for any system.
1. Configure the Open Files System Limit
Select a system limit that will accommodate the number of open files needed total on the system. Multiplying the number of open files needed by a single workload process by the number of processes expected to run. Set the fs.max-file kernel setting to this value, plus some buffer. Example, a system is running 4 processes that require 800,000 open files, a value of 3200000 can be used if the setting isn’t already set high enough.
fs.file-max = 3400000 # (800000 * 4) + 2000002. Configure the Open Files Process Limit
Select a process limit to accommodate the highest number of open files needed for a single workload processes. Example, the workload processes require a maximum of 800,000 open files:
fs.nr_open = 8010003. Configure the Open Files User Limit
To adjust the user limit to take advantage of the system limits, set the nofile value to the maximum number of open files needed connection sockets for all listeners plus any other file descriptor needs for the worker processes, and include some buffer. User limits are set under /etc/security/limits.conf, or a conf file under /etc/security/limits.d/, or in the systemd unit file for the service. Example:
# cat /etc/security/limits.d/nginx.confnginx soft nofile 800000nginx hard nofile 800000cat /lib/systemd/system/nginx.service
Like file descriptor limits, the number of workers, or threads, that a process can create is limited by both a kernel setting and a user limit.
Threads System Limit
Processes can spin up worker threads. The maximum number of all threads that can be created is set with the kernel setting kernel.threads-max. To see the max number of threads along with the current number of threads executing on a system, run the following commands:
Get current max threads:
cat /proc/sys/kernel/threads-maxThe default is the number of memory pages divided by 4.
Total threads running:
$ ps -eo nlwp | awk '$1 ~ /^[0-9]+$/ { n += $1 } END { print n }'As long as the total number of threads is lower than the max, the server will be able to create new threads for processes as long as they’re within user limits.
Threads Process Limit
Unlike kernel settings for open files limits, there is no direct process limit setting for threads. This is handled indirectly by the kernel.
A setting that can impact the number of threads that can forked is kernel.pid_max. This will set the maximum number of threads that can execute simultaneously by limiting the number of process IDs that are available. Increasing this will allow the system to execute more more threads concurrently.
Another setting is vm.max_map_count. This controls the amount of mapped memory areas for each thread. A general rule of thumb is to increase this to double the number of expected concurrent threads one a system.
Threads User Limit
In addition to the max threads system limit, each user process is limited to a maximum number of threads. This is again set with the system’s limits.conf (nproc), or in the processes systemd unit file if running a process under systemd (LimitNPROC). To see the maximum number of threads a process can fork():
ulimit -u4096And under systemd, using nginx as an example:
systemctl show nginx | grep LimitNPROC4096Updating the Thread Settings to Required Values
In most systems, the system limit is already set high enough to handle the number of threads a high performance server needs. However, to adjust the system limit, set the kernel.threads-max kernel setting to the maximum number of threads the system needs, plus some buffer. Example:
kernel.threads-max = 3261780To adjust the user limit, set the value high enough for the number of worker threads needed to handle the volume of traffic including some buffer. As with nofile, the nproc user limits are set under /etc/security/limits.conf, or a conf file under /etc/security/limits.d/, or in the systemd unit file for the service. Example, with nproc and nofile:
# cat /etc/security/limits.d/nginx.conf
nginx soft nofile 800000nginx hard nofile 800000nginx soft nproc 800000nginx hard nproc 800000# cat /lib/systemd/system/nginx.service
[Unit]Description=OpenResty Nginx - high performance web serverDocumentation=https://www.nginx.org/en/docs/After=network-online.target remote-fs.target nss-lookup.targetWants=network-online.target[Service]Type=forkingLimitNOFILE=800000LimitNPROC=800000PIDFile=/var/run/nginx.pidExecStart=/usr/local/openresty/nginx/sbin/nginx -c /usr/local/openresty/nginx/conf/nginx.confExecReload=/bin/kill -s HUP $MAINPIDExecStop=/bin/kill -s TERM $MAINPID[Install]WantedBy=multi-user.targetTCP Reverse Proxy Connections in TIME_WAITUnder high volume burst traffic, proxy connections stuck in “TIME_WAIT” can add up tying up many resources during the close connection handshake. This state indicates the client has received a final FIN packet from the server (or upstream worker) and is being kept around to any delayed in-flight packets to be properly handled.
The time the connection exists in “TIME_WAIT” by default is 2 x MSL (Maximum Segment Length), which is 2 x 60s. In many cases, this is normal and expected behavior and the default of 120s is acceptable. However, when the volume of connections in the “TIME_WAIT” state is high, this can cause the application to run out of ephemeral ports to connect to a client socket. In this case, let these time out faster by reducing the FIN timeout.
The kernel setting that controls this timeout is net.ipv4.tcp_fin_timeout and a good setting for a high performance server is between 5 and 7 seconds.
The receive queue should be sized to handle as many packets as linux can process off of the NIC without causing dropped packets, including some small buffer in case spikes are a bit higher than expected. The softnet_stat file should be monitored for dropped packets to discover the correct value. A good rule of thumb is to use the value set for tcp_max_syn_backlog to allow for at least as many SYN packets that can be processed to create half-open connections. Remember, this is the number of packets each CPU can have in its receive buffer, so divide the total desired by the number of CPUs to be conservative.
The SYN backlog queue should be sized to allow for a large number of half-open connections on a high performance server to handle bursts of occasional spike traffic. A good rule of thumb is to set this at least to the highest number of established connections a listener can have in the accept queue, but no higher than twice the number of established connections a listener can have. It is also recommended to turn off SYN cookie protection on these systems to avoid data loss on high burst initial connections from legitimate clients.
The accept queue should be sized to allow for holding a volume of established connections waiting to be processed as a temporary buffer during periods of high burst traffic. A good rule of thumb is to set this between 20–25% of the number of worker threads.
Configurations
The following kernel settings were discussed in this article, using nginx.
# /etc/sysctl.d/99-nginx.conf
# /proc/sys/fs/file-max
# Maximum number of file handles that can be allocated.
# aka: open files.
# NOTES
# - This should be sized to accommodate the number of connections
# (aka: file handles or open files) needed by all processes.
# RECOMMENDATION
# - Increase this setting if more high connection processes are
# started.
# SEE ALSO
# - /proc/sys/fs/file-nr
fs.file-max = 3400000
# /proc/sys/fs/nr_open
# Maximum number of file handles that a single process can
# allocate, aka: open files or connections.
# NOTES
# - Each process requires a high number of connections to operate.
# RECOMMENDATION
# - None
# SEE ALSO
# - net.core.somaxconn
# - user limits: nofile
fs.nr_open = 801000
# /proc/sys/net/core/somaxconn
# Accept Queue Limit, maximum number of established connections
# waiting for accept() per listener.
# NOTES
# - Maximum size of accept() for each listener.
# - Do not size this less than net.ipv4.tcp_max_syn_backlog
# SEE ALSO
# net.ipv4.tcp_max_syn_backlog
net.core.somaxconn = 65535
# /proc/sys/net/ipv4/tcp_max_syn_backlog
# SYN Backlog Queue, number of half-open connections
# NOTES
# - Example server: 8 cores, can handle over 65535 total half-open
# connections.
# - Do not size this more than net.core.somaxconn
# SEE ALSO
# - net.core.netdev_max_backlog
# - net.core.somaxconn
net.ipv4.tcp_max_syn_backlog = 65535
# /proc/sys/net/core/netdev_max_backlog
# Receive Queue Size per CPU Core, number of packets.
# NOTES
# - Example server: 8 cores, each core should at least be able to
# receive 1/8 of the tcp_max_syn_backlog.
# RECOMMENDATION
# - Size this to be double the number needed; in the example, 1/4.
# SEE ALSO
# - net.ipv4.tcp_max_syn_backlog
net.core.netdev_max_backlog = 16386
# /proc/sys/net/ipv4/syn_retries
# /proc/sys/net/ipv4/synack_retries
# Maximum number of SYN and SYN+ACK retries before packet
# expires.
# NOTES
# - Reduces connection time to fail
net.ipv4.tcp_syn_retries = 1
net.ipv4.tcp_synack_retries = 1
# /proc/sys/net/ipv4/tcp_fin_timeout
# Timeout in seconds to close client connections in TIME_WAIT
# after receiving FIN packet.
# NOTES
# - Improves socket availability performance, allows for closed
# connections to be resused more quickly.
net.ipv4.tcp_fin_timeout = 5
# /proc/sys/net/ipv4/tcp_syncookies
# Disable SYN cookie flood protection.
# NOTES
# - Only disable this on systems that require a high volume of
# legal connections in a short amount of time, ie: bursts.
net.ipv4.tcp_syncookies = 0
# /proc/sys/kernel/threadsmax
# Maximum number of threads system can have, total.
# NOTES
# - Commented, may not be needed; check system.
# SEE ALSO
# - user limits.
#kernel.threads-max = 3261780
The following user limit settings were discussed in this article:
# /etc/security/limits.d/nginx.conf
nginx soft nofile 800000
nginx hard nofile 800000
nginx soft nproc 800000
nginx hard nproc 800000