How to handle peer death in TCP connection

Introduction

After 3-way TCP handshake, TCP connection stays until normal 2-way termination protocol or abort due to error cases. Error cases can be network issue or peer death.

Think of a simple TCP connection between Peer A and Peer B: there is the initial three-way handshake, with one SYN segment from A to B, the SYN/ACK back from B to A, and the final ACK from A to B. At this time, we're in a stable status: connection is established, and now we would normally wait for someone to send data over the channel. Let A is awaiting data from B. Now, unplug the power supply from B and instantaneously reboot B, it will shutdown without sending anything over the network to notify A that the connection is going to be broken. A, from its side, is ready to receive data, and has no idea that B has crashed. Now restore the power supply to B and wait for the system to restart. A and B are now back again, but while A knows about a connection still active with B, B has no idea. The situation resolves itself when A tries to send data to B over the dead connection, and B replies with an RST packet, causing A to finally to close the connection.

_____ _____ | | | | | A | | B | |_____| |_____| ^ ^ |--->--->--->-------------- SYN -------------->--->--->---| |---<---<---<------------ SYN/ACK ------------<---<---<---| |--->--->--->-------------- ACK -------------->--->--->---| | | | system crash ---> X | | system restart ---> ^ | | |--->--->--->-------------- PSH -------------->--->--->---| |---<---<---<-------------- RST --------------<---<---<---| | |

Dead peer detection method

TCP keepalive

Keepalive can tell you when another peer becomes unreachable without the risk of false-positives. In fact, if the problem is in the network between two peers, the keepalive action is to wait some time and then retry, sending the keepalive packet before marking the connection as broken.

Alternative approach

Application using TCP connection can periodically send the keep-alive message. The method will be similar to TCP keepalive. However it will happen in application layer.

Why TCP keep-alive is not part of TCP standard?

The TCP protocol is perfectly happy to allow both devices to stop transmitting for a very long period of time, and then simply resume transmissions of data and acknowledgment segments when either has data to send.

Using TCP keepalive under Linux

Linux has built-in support for keepalive. To modify the number of probes or the probe intervals, you write values to the /proc filesystem like

echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl echo 20 > /proc/sys/net/ipv4/tcp_keepalive_probes

Note that these values are global for all keepalive enabled sockets on the system, We can also override these settings on a per socket basis when you set the linux socket setsockopt API.

Troubleshooting

A TCP Keep-Alive is sent with a Seq No one less than the sequence number the receiver is expecting. Because the receiver has already ACKd the Seq No of the Keep-Alive (because that Seq No was in the range of an earlier segment), it just ACKs it again and discards the segment (packet).

Below example shows tcp-keepalive packet in wireshark.

Reference

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html

http://man7.org/linux/man-pages/man7/tcp.7.html

http://stackoverflow.com/questions/5435098/how-to-use-so-keepalive-option-properly-to-detect-that-the-client-at-the-other-e

https://ask.wireshark.org/questions/44609/wireshark-tcp-keep-alive-detection

http://stackoverflow.com/questions/5855774/how-can-i-figure-out-if-a-packet-is-a-tcp-keep-alive