TCP troubleshooting

Troubleshooting TCP flow


Dropped RX packets 

check how many packets are being dropped per iface

column -t /proc/net/dev

from:  https://levelup.gitconnected.com/linux-kernel-tuning-for-high-performance-networking-high-volume-incoming-connections-196e863d458a


TCP Receive Queue and netdev_max_backlog

Each CPU core can hold a number of packets in a ring buffer before the network stack is able to process them. If the buffer is filled faster than TCP stack can process a packet, a dropped packet counter is incremented and the packet is dropped. The net.core.netdev_max_backlog setting should be increased to maximize the number of packets queued for processing on servers with high burst traffic.

net.core.netdev_max_backlog is a per CPU core setting.


TCP Backlog Queue and tcp_max_syn_backlog

The TCP Backlog Queue holds incomplete connections waiting to complete.

A connection is created for any SYN packets that are picked up from the receive queue and are moved to the SYN Backlog Queue. The connection is marked “SYN_RECV” and a SYN+ACK is sent back to the client.

These connections are not moved to the accept queue until the corresponding ACK is received and processed.

The maximum number of connections in the queue is set in the net.ipv4.tcp_max_syn_backlog kernel setting.

Under normal load, the number of SYN backlog entries should be no higher than 1 under normal load, and should remain below the tcp_max_syn_backlog limit under heavy load. To check the current size of a TCP port’s SYN backlog, run the following command (example uses TCP port 80):

ss -n state syn-recv sport = :80 | wc -l

If there are a high number of connections in the “SYN_RECV” state, this can cause problems on a server taking high volume traffic. Before increasing this limit, it may be possible to reduce the time a SYN packet sits in this queue by tuning related TCP settings.


SYN Cookies

Tuning this can decrease the duration a SYN packet sits in the receive queue. If SYN cookies are not enabled, the client will simply retry sending a SYN packet. If SYN cookies are enabled (net.ipv4.tcp_syncookies), the connection is not created and is not placed in the SYN backlog, but a SYN+ACK packet is sent to the client as if it was. SYN cookies may be beneficial under normal traffic, but during high volume burst traffic some connection details will be lost and the client will experience issues when the connection is established. There’s a bit more to it than just the SYN cookies, but here’s a write up called “SYN cookies ate my dog” written by Graeme Cole that explains in detail why enabling SYN cookies on high performance servers can cause issues.

SYN+ACK Retries

Tuning this can significantly decrease the duration a SYN packet sits in the receive queue. What happens when a SYN+ACK is sent but never gets a response ACK packet? In this case, the network stack on the server will retry sending the SYN+ACK. The delay between attempts are calculated to allow for server recovery.

If the server receives a SYN, sends a SYN+ACK, and does not receive an ACK, the length of time a retry take follows the Exponental Backoff algorithm and therefore depends on the retry counter for the number of attempts.

The kernel setting that defines the number of SYN+ACK retries is net.ipv4.tcp_synack_retries with a default setting of 5. This will retry at the following intervals after the first attempt: 1s, 3s, 7s, 15s, 31s. The last retry will timeout after roughly 63s after the first attempt was made, which corresponds to when the next attempt would have been made if the number of retries was 6. This alone can keep a SYN packet in the SYN backlog for more than 60 seconds before the packet times out. If the SYN backlog queue is small, it doesn’t take a large volume of connections to cause an amplification event in the network stack where half-open connections never complete and no connections can be established. Set the number of SYN+ACK retries to 0 or 1 to avoid this behavior on high performance servers.

SYN Retries

Tuning this can significantly decrease the duration a SYN packet sits in the receive queue. Although SYN retries refer to the number of times a client will retry sending a SYN while waiting for a SYN+ACK, it can also impact high performance servers that make proxy connections. An nginx server making a few dozen proxy connections to a backend server due to a spike of traffic can overload the backend server’s network stack for a short period, and retries can create an amplification on the backend on both the receive queue and the SYN backlog queue. This, in turn, can impact the client connections being served. The kernel setting for SYN retries is net.ipv4.tcp_syn_retries and defaults to 5 or 6 depending on distribution. Rather than retry for upwards of 63–130s (exponential backoff), limit the number of SYN retries to 0 or 1.

See the following for more information on addressing client connection issues on a reverse proxy server


TCP Accept Queue and somaxconn

Applications are responsible for creating their accept queue when opening a listener port when callinglisten() by specifying a “backlog” parameter. As of linux kernel v2.2, this parameter changed from setting the maximum number of incomplete connections a socket can hold to the maximum number of completed connections waiting to be accepted. As described above, the maximum number of incomplete connections is now set with the kernel setting net.ipv4.tcp_max_syn_backlog.

The TCP listen() backlog

Although the application is responsible for the accept queue size on each listener it opens, there is a limit to the number of connections that can be in the listener’s accept queue. There are two settings that control the size of the queue:

Accept Queue Default

The default value for net.core.somaxconn comes from the SOMAXCONN constant, which is set to 128 on linux kernels up through v5.3, while SOMAXCONN was raised to 4096 in v5.4. However, v5.4 is the most current version at the time of this writing and has not been widely adopted yet, so the accept queue is going to be truncated to 128 on many production systems that have not modified net.core.somaxconn.

Applications typically use the value of the SOMAXCONN constant when configuring the default backlog for a listener if it is not set in the application configuration, or it’s sometimes simply hard-coded in the server software. Some applications set their own default, like nginx which sets it to 511 — which is silently truncated to 128 on linux kernels through v5.3. Check the application documentation for configuring the listener to see what is used.

To check the accept() queue size that is configured for open TCP listener ports, run the following command (example port 80):

Accept Queue Maximum

The maximum value for the net.core.somaxconn is 65535 in kernels v2.2 through v4.0.x, and 4294967295 in kernels v4.1.0+.

Accept Queue Override

Many applications allow the accept queue size to be specified in the configuration by providing a “backlog” value on the listener directive or a configuration that will be used when calling listen(). Example, nginx has a backlog parameter that can be added to the listen directive that can be used to adjust the size of the accept queue for the listener port:


If an application calls listen() with a backlog value larger than net.core.somaxconn, then the backlog for that listener will be silently truncated to the somaxconn value.

Application Workers

If the accept queue is large, also consider increasing the number of threads that can handle accepting requests from the queue in the application. For example, setting a backlog of 20480 on an HTTP listener for a high volume nginx server without allowing for enough worker_connections to manage the queue will cause connection refused responses from the server.

File Descriptors (file handles, connections)

On linux systems, everything is a file. This includes, actual files and folders, symlinks, pipes, and sockets among others. Because of this, configuring the maximum number of connections for a process also requires configuring the number of files a process can open.

Every socket in a connection also uses a file descriptor.

Open Files System Limit

The maximum number of all file handles that can be allocated to the system is set with the kernel setting fs.file-max.

The fs.file-max setting is the total maximum number of file handles that can be allocated and used on a system.

To see the current number of file descriptors allocated and the max allowed, cat the following file:

# cat /proc/sys/fs/file-nr1976      0       2048

The output shows that the number of file descriptors in use is 1976, the number of allocated but free file descriptors is 0 (this will always show “0” on kernel v2.6+ meaning used and allocated always match), and the maximum is 2048. On a high performance system, this should be set high enough to handle the maximum number of connections and any other file descriptor needs for all processes on the system. 2048 is very low for this kind of system, and 1976 is very close to the maximum.

Open Files Process Limit

The maximum number of files that can be opened by a single process is governed by the kernel setting fs.nr_open. This setting should be no larger than one third of fs.file-max. By default, fs.nr_open should be large enough for any single process running on a system without needing to adjust it.

The fs.nr_open setting is the maximum value that can set for the “number of open files”, or nofile, user limit.

Open Files User Limit

In addition to the file descriptor system and process limits, each user is limited to a maximum amount of open file descriptors. This is set with the system’s limits.conf (nofile), or in the processes systemd unit file if running a process under systemd (LimitNOFILE). To see the maximum number of file descriptors a user can have open by default:

$ ulimit -n1024

And under systemd, using nginx as an example:

$ systemctl show nginx | grep LimitNOFILE

4096

Updating the Open File Settings to Required Values

There are many guides available to explain how to make these settings work for a file system needy process. This is a detailed approach that has worked on high volume systems and should work for any system.

1. Configure the Open Files System Limit

Select a system limit that will accommodate the number of open files needed total on the system. Multiplying the number of open files needed by a single workload process by the number of processes expected to run. Set the fs.max-file kernel setting to this value, plus some buffer. Example, a system is running 4 processes that require 800,000 open files, a value of 3200000 can be used if the setting isn’t already set high enough.

fs.file-max = 3400000 # (800000 * 4) + 200000

2. Configure the Open Files Process Limit

Select a process limit to accommodate the highest number of open files needed for a single workload processes. Example, the workload processes require a maximum of 800,000 open files:

fs.nr_open = 801000

3. Configure the Open Files User Limit

To adjust the user limit to take advantage of the system limits, set the nofile value to the maximum number of open files needed connection sockets for all listeners plus any other file descriptor needs for the worker processes, and include some buffer. User limits are set under /etc/security/limits.conf, or a conf file under /etc/security/limits.d/, or in the systemd unit file for the service. Example:

# cat /etc/security/limits.d/nginx.confnginx soft nofile 800000nginx hard nofile 800000

# cat /lib/systemd/system/nginx.service

[Unit]

Description=OpenResty Nginx - high performance web server

Documentation=https://www.nginx.org/en/docs/

After=network-online.target remote-fs.target nss-lookup.target

Wants=network-online.target

[Service]

Type=forking

LimitNOFILE=800000

PIDFile=/var/run/nginx.pid

ExecStart=/usr/local/openresty/nginx/sbin/nginx -c /usr/local/openresty/nginx/conf/nginx.conf

ExecReload=/bin/kill -s HUP $MAINPID

ExecStop=/bin/kill -s TERM $MAINPID

[Install]

WantedBy=multi-user.target

Worker Limits (threads/executors)

Like file descriptor limits, the number of workers, or threads, that a process can create is limited by both a kernel setting and a user limit.

Threads System Limit

Processes can spin up worker threads. The maximum number of all threads that can be created is set with the kernel setting kernel.threads-max. To see the max number of threads along with the current number of threads executing on a system, run the following commands:

Get current max threads:

cat /proc/sys/kernel/threads-max

The default is the number of memory pages divided by 4.

Total threads running:

$ ps -eo nlwp | awk '$1 ~ /^[0-9]+$/ { n += $1 } END { print n }'

As long as the total number of threads is lower than the max, the server will be able to create new threads for processes as long as they’re within user limits.

Threads Process Limit

Unlike kernel settings for open files limits, there is no direct process limit setting for threads. This is handled indirectly by the kernel.

A setting that can impact the number of threads that can forked is kernel.pid_max. This will set the maximum number of threads that can execute simultaneously by limiting the number of process IDs that are available. Increasing this will allow the system to execute more more threads concurrently.

Another setting is vm.max_map_count. This controls the amount of mapped memory areas for each thread. A general rule of thumb is to increase this to double the number of expected concurrent threads one a system.

Threads User Limit

In addition to the max threads system limit, each user process is limited to a maximum number of threads. This is again set with the system’s limits.conf (nproc), or in the processes systemd unit file if running a process under systemd (LimitNPROC). To see the maximum number of threads a process can fork():

$ ulimit -u4096

And under systemd, using nginx as an example:

$ systemctl show nginx | grep LimitNPROC4096

Updating the Thread Settings to Required Values

In most systems, the system limit is already set high enough to handle the number of threads a high performance server needs. However, to adjust the system limit, set the kernel.threads-max kernel setting to the maximum number of threads the system needs, plus some buffer. Example:

kernel.threads-max = 3261780

To adjust the user limit, set the value high enough for the number of worker threads needed to handle the volume of traffic including some buffer. As with nofile, the nproc user limits are set under /etc/security/limits.conf, or a conf file under /etc/security/limits.d/, or in the systemd unit file for the service. Example, with nproc and nofile:

# cat /etc/security/limits.d/nginx.conf

nginx soft nofile 800000

nginx hard nofile 800000

nginx soft nproc 800000

nginx hard nproc 800000

# cat /lib/systemd/system/nginx.service

[Unit]

Description=OpenResty Nginx - high performance web server

Documentation=https://www.nginx.org/en/docs/

After=network-online.target remote-fs.target nss-lookup.target

Wants=network-online.target

[Service]

Type=forking

LimitNOFILE=800000

LimitNPROC=800000

PIDFile=/var/run/nginx.pid

ExecStart=/usr/local/openresty/nginx/sbin/nginx -c /usr/local/openresty/nginx/conf/nginx.conf

ExecReload=/bin/kill -s HUP $MAINPID

ExecStop=/bin/kill -s TERM $MAINPID

[Install]

WantedBy=multi-user.target

TCP Reverse Proxy Connections in TIME_WAIT

Under high volume burst traffic, proxy connections stuck in “TIME_WAIT” can add up tying up many resources during the close connection handshake. This state indicates the client has received a final FIN packet from the server (or upstream worker) and is being kept around to any delayed in-flight packets to be properly handled. The time the connection exists in “TIME_WAIT” by default is 2 x MSL (Maximum Segment Length), which is 2 x 60s. In many cases, this is normal and expected behavior and the default of 120s is acceptable. However, when the volume of connections in the “TIME_WAIT” state is high, this can cause the application to run out of ephemeral ports to connect to a client socket. In this case, let these time out faster by reducing the FIN timeout.

The kernel setting that controls this timeout is net.ipv4.tcp_fin_timeout and a good setting for a high performance server is between 5 and 7 seconds.

Bringing it All Together

The receive queue should be sized to handle as many packets as linux can process off of the NIC without causing dropped packets, including some small buffer in case spikes are a bit higher than expected. The softnet_stat file should be monitored for dropped packets to discover the correct value. A good rule of thumb is to use the value set for tcp_max_syn_backlog to allow for at least as many SYN packets that can be processed to create half-open connections. Remember, this is the number of packets each CPU can have in its receive buffer, so divide the total desired by the number of CPUs to be conservative.

The SYN backlog queue should be sized to allow for a large number of half-open connections on a high performance server to handle bursts of occasional spike traffic. A good rule of thumb is to set this at least to the highest number of established connections a listener can have in the accept queue, but no higher than twice the number of established connections a listener can have. It is also recommended to turn off SYN cookie protection on these systems to avoid data loss on high burst initial connections from legitimate clients.

The accept queue should be sized to allow for holding a volume of established connections waiting to be processed as a temporary buffer during periods of high burst traffic. A good rule of thumb is to set this between 20–25% of the number of worker threads.

Configurations

The following kernel settings were discussed in this article, using nginx.

# /etc/sysctl.d/99-nginx.conf



# /proc/sys/fs/file-max

# Maximum number of file handles that can be allocated.

#  aka: open files.

# NOTES

# - This should be sized to accommodate the number of connections

#    (aka: file handles or open files) needed by all processes.

# RECOMMENDATION

# - Increase this setting if more high connection processes are

#    started.

# SEE ALSO

# - /proc/sys/fs/file-nr

fs.file-max = 3400000


# /proc/sys/fs/nr_open

# Maximum number of file handles that a single process can

#  allocate, aka: open files or connections.

# NOTES

# - Each process requires a high number of connections to operate.

# RECOMMENDATION

# - None

# SEE ALSO

# - net.core.somaxconn

# - user limits: nofile

fs.nr_open = 801000



# /proc/sys/net/core/somaxconn

# Accept Queue Limit, maximum number of established connections

#  waiting for accept() per listener.

# NOTES

# - Maximum size of accept() for each listener.

# - Do not size this less than net.ipv4.tcp_max_syn_backlog

# SEE ALSO

# net.ipv4.tcp_max_syn_backlog

net.core.somaxconn = 65535


# /proc/sys/net/ipv4/tcp_max_syn_backlog

# SYN Backlog Queue, number of half-open connections

# NOTES

# - Example server: 8 cores, can handle over 65535 total half-open

#    connections.

# - Do not size this more than net.core.somaxconn

# SEE ALSO

# - net.core.netdev_max_backlog

# - net.core.somaxconn

net.ipv4.tcp_max_syn_backlog = 65535


# /proc/sys/net/core/netdev_max_backlog

# Receive Queue Size per CPU Core, number of packets.

# NOTES

# - Example server: 8 cores, each core should at least be able to

#    receive 1/8 of the tcp_max_syn_backlog.

# RECOMMENDATION

# - Size this to be double the number needed; in the example, 1/4.

# SEE ALSO

# - net.ipv4.tcp_max_syn_backlog

net.core.netdev_max_backlog = 16386



# /proc/sys/net/ipv4/syn_retries

# /proc/sys/net/ipv4/synack_retries

# Maximum number of SYN and SYN+ACK retries before packet

#  expires.

# NOTES

# - Reduces connection time to fail

net.ipv4.tcp_syn_retries = 1

net.ipv4.tcp_synack_retries = 1


# /proc/sys/net/ipv4/tcp_fin_timeout

# Timeout in seconds to close client connections in TIME_WAIT

#  after receiving FIN packet.

# NOTES

# - Improves socket availability performance, allows for closed

#    connections to be resused more quickly.

net.ipv4.tcp_fin_timeout = 5


# /proc/sys/net/ipv4/tcp_syncookies

# Disable SYN cookie flood protection.

# NOTES

# - Only disable this on systems that require a high volume of

#    legal connections in a short amount of time, ie: bursts.

net.ipv4.tcp_syncookies = 0


# /proc/sys/kernel/threadsmax

# Maximum number of threads system can have, total.

# NOTES

# - Commented, may not be needed; check system.

# SEE ALSO

# - user limits.

#kernel.threads-max = 3261780

The following user limit settings were discussed in this article:

# /etc/security/limits.d/nginx.conf

nginx soft nofile 800000

nginx hard nofile 800000

nginx soft nproc 800000

nginx hard nproc 800000