Tuning BGP

Tuning BGP

This article will focus on tuning BGP to deliver large amounts of information as efficiently as possible. This article will cover TCP Operation and Router Queues. BGP Update Generation is covered in separate articles in 1 and 2.

TCP protocol considerations

There are 2 main parameters that affect TCP performance. They are -

a) Maximum Segment Size (MSS) - controls the size of the TCP segment (or TCP packet)

b) TCP Window Size - controls the rate at which packets can be sent

a) Maximum Segment Size (MSS)

When a TCP session is initiated, the TCP MSS option is carried in the SYN and SYN-ACK messages. Consider R1 initiates a TCP session to R2. R1 includes MSS=1436 (bytes) in SYN message and R2 sends a corresponding SYN-ACK message with MSS=536 (bytes). The lower of the two MSSs is selected for that TCP session.

RFC 791 mandates that a host must support a packet size of atleast 576 bytes. This ensures that the packet can be sent to the destination without any fragmentation. If we take off 20 bytes of TCP header and 20 bytes of IPv4 header, then that leaves 536 bytes for TCP payload. All Cisco IOS routers uses a default MSS of 536 bytes.

However, using a default MSS of 536 bytes, drastically reduces performance as the number of packets required to send large number of BGP prefix information increases significantly. Further, the number of TCP ACK messages also increases.

If the Maximum Transmission Unit (MTU) is 1500 bytes, then TCP MSS can be set to 1460 bytes [1500 - 20 (IP header) - 20 (TCP header) = 1460]. This can he improve the performance by sending large amount of information in far less packets, and essentially, far too less TCP acknowledgements.

b) TCP Window Size

TCP Window Size is the mechanism to control that rate at which TCP sends packets. It is the amount of information a TCP session can transmit before it must receive a TCP ACK. By default, this value is set to 16KB on Cisco routers. This value can be changed using ip tcp window-size <value> CLI command.

Path MTU Discovery

The Path MTU Discovery (PMTUD) feature is used to determine the MTU between 2 nodes. This allows the TCP session to set the maximum possible MSS to improve TCP performance for large data transfers without causing IP fragmentation.

Consider in figure below, R1 initiates a TCP session to R4. The first packet R1 creates has MSS set to the maximum value (8960 bytes) and DF=1 in the IP header. If the packet reaches the destination, the session forms. However, in this case, R3 discards the packet and responds to R1 with ICMP Packet Too Big - Fragmentation required, and includes the MTU (1500 bytes) of the link R3-R4 which could not accommodate the packet. R1 now creates another packet with MSS set to 1460 bytes and DF=1 in the IP header and sends it to R4. This process continues until the packet reaches the destination.

PMTUD works fine only if the ICMP message successfully comes back to the session initiator.

Queue Optimization

In Cisco IOS, every two TCP packets are acknowledged with a TCP ACK message. If TCP ACK messages are not received, then it causes re-transmission. This causes the input queue on the receiving router to overflow, resulting in packet loss. The purpose of queue optimization is to minimize packet loss.

Every BGP packet goes through a packet reception process which has three components-

a) Input Hold Queue

This is a counter that is assigned to an interface; it is not an actual queue. When a BGP packet is received on an interface destined for a router processor, the input hold-queue is incremented by 1. After that packet is processed, the input hold-queue is decremented by 1 to indicate that the packet is no longer in the queue. Each input queue has a maximum queue depth.

b) Selective Packet Discard (SPD) Headroom

SPD Headroom is a counter that allows the input hold-queue to exceed the maximum size. The total value of SPD Headroom is shared by all interfaces.

c) System Buffers

The System buffers store the incoming packets being sent to the process level. The packet destined for the processor is removed from the interface buffer and put in the system buffer.

The packet reception process is as follows-

    1. BGP packet is received on an interface.
    2. A System Buffer is requested.
      1. if system buffer is not available, drop the packet.
      2. if system buffer is available, the input hold-queue is checked. If queue is full, packet priority is checked.
        1. If packet has priority IP Prec 6 or L2 Keepalive, SPD Headroom is checked. If SPD Headroom is full, packet is dropped. If there is room, the packet is kept and input hold-queue is incremented.
        2. If the packet is normal priority, drop the packet.
      3. If the queue is not full, the packet is kept and input hold-queue is incremented.
    3. The packet is processed.
    4. The input hold-queue is decremented.

Input Hold-Queue

The Input Hold-Queue is set to 75 by default. This value can be changed using the hold-queue <value> in interface configuration command. The hold-queue value can be seen using show interfaces command.

R1#show int gigabitEthernet 0/0

GigabitEthernet0/0 is up, line protocol is up

Hardware is PQ3_TSEC, address is 6400.f122.6300 (bia 6400.f122.6300)

Description: Telstra Link Port FNN: N3008515R

Internet address is 192.168.255.10/30

MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

Full Duplex, 1Gbps, media type is RJ45

output flow-control is unsupported, input flow-control is unsupported

ARP type: ARPA, ARP Timeout 04:00:00

Last input 00:00:00, output 00:00:00, output hang never

Last clearing of "show interface" counters never

Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 1850

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 3283000 bits/sec, 772 packets/sec

5 minute output rate 6932000 bits/sec, 1307 packets/sec

11733170299 packets input, 2503963531526 bytes, 0 no buffer

Received 50983 broadcasts (0 IP multicasts)

0 runts, 0 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

0 watchdog, 0 multicast, 0 pause input

28175928629 packets output, 20450282398187 bytes, 0 underruns

0 output errors, 0 collisions, 0 interface resets

0 unknown protocol drops

0 babbles, 0 late collision, 0 deferred

7 lost carrier, 0 no carrier, 0 pause output

0 output buffer failures, 0 output buffers swapped out

TCP Window size and TCP MSS determine the maximum number of outstanding TCP packets requiring to be acknowledged. The worst-case hold-queue size can be determined using the formula-

Hold-Queue Size = [TCP Window Size * Number of BGP Peers]/ 2*TCP MSS

Selective Packet Discard (SPD)

SPD is a queue-management mechanism that operates on the input hold-queue for traffic destined to the processor. The SPD process divides the queue to the route processor into a general packet queue (GPQ) and a priority queue. The packets in the GPQ are subject to preemptive discard. GPQ is for IP packets only and it is a global queue, not an interface queue. The packets in the priority queue are not subject to discard.

SPD has two threshold queues - the minimum threshold and the maximum threshold. The three SPD queue states are-

    • Normal state - GPQ depth <= minimum threshold
    • Random Drop state - minimum threshold < GPQ depth <= maximum threshold
    • Full Drop state - GPQ depth > maximum threshold

When the queue is in Normal state, packets are not discarded.

In the Random Drop state, SPD begins dropping normal-priority packets randomly. In Random Drop state, SPD can run in normal or aggressive mode. In aggressive mode, SPD drops malformed normal-priority IP packets. In normal mode (default mode), SPD drops normal-priority packets randomly.

In Full Drop state, SPD drops all normal-priority packets until the queue-depth drops below the maximum threshold.

The SPD process is also used to detect high-priority traffic. Two extensions to the input-queue are available to high-priority traffic: SPD Headroom and SPD Extended Headroom. The SPD Headroom allows the input queue to exceed the configured input hold queue. If input hold queue is 75 and SPD Headroom is 100, the input hold queue can hold 175 packets. Once the input hold queue reached 75, only high-priority (IP Prec 6, IGP and L2 Keepalives) are accepted until the queue reaches the depth of 175. The SPD Extended Headroom further allows the input queue to exceed the configured input hold-queue and SPD Headroom. The SPD Extended Headroom is only for IGP packets and L2 Keepalives.

The threshold values can be viewed using show ip spd command. The threshold values are calculated based on the smallest value of input hold-queue depth of the interface on the router - the minimum threshold value is 2 less than the size of the input queue and the maximum threshold value is 1 less than the size of the input queue.

Router# show ip spd

Current mode: normal.

Queue min/max thresholds: 73/74, Headroom: 100, Extended Headroom: 10

IP normal queue: 0, priority queue: 0.

SPD special drop mode: none

System Buffers

The system buffers are where the actual packets are stored destined for the processor. These are created and destroyed based on as-needed-basis. The show buffers CLI command shows the buffers on the device.

Router# show buffers

Buffer elements:

516 in free list (500 max allowed)

142439046 hits, 0 misses, 618 created

Public buffer pools:

Small buffers, 104 bytes (total 50, permanent 50, peak 191 @ 7w0d):

47 in free list (20 min, 150 max allowed)

54576310 hits, 244 misses, 839 trims, 839 created

0 failures (0 no memory)

Middle buffers, 600 bytes (total 25, permanent 25, peak 55 @ 7w0d):

22 in free list (10 min, 150 max allowed)

11221229 hits, 92 misses, 276 trims, 276 created

0 failures (0 no memory)

Big buffers, 1536 bytes (total 50, permanent 50):

50 in free list (5 min, 150 max allowed)

14240216 hits, 0 misses, 0 trims, 0 created

0 failures (0 no memory)

VeryBig buffers, 4520 bytes (total 10, permanent 10):

9 in free list (0 min, 100 max allowed)

1 hits, 0 misses, 0 trims, 0 created

0 failures (0 no memory)

Large buffers, 5024 bytes (total 0, permanent 0):

0 in free list (0 min, 10 max allowed)

0 hits, 0 misses, 0 trims, 0 created

0 failures (0 no memory)

Huge buffers, 18024 bytes (total 5, permanent 0, peak 7 @ 7w0d):

5 in free list (4 min, 10 max allowed)

7291 hits, 1 misses, 10434 trims, 10439 created

0 failures (0 no memory)

<------ Output ommitted for brevity ------>

The different types of buffers are highlighted in the output above. Of keen interest with respect to BGP tuning is the Small Buffers as the TCP ACK messages are 64 byte packets which is stored in this buffer. The numbers to look for are Permanent which indicates the number of buffers that will always be present in the pool. This number should be sufficient enough to store TCP ACKs from all BGP peers. The router will eventually create more buffers if needed, but there could be some TCP ACKs which could be lost while that's happened. The other item to ensure is the Min Free List. It is advisable to increase this number to prompt the router to create more buffers before it reaches a critical limit. And the last item is Max Free List. This number should be increased to help prevent buffers from being trimmed prematurely.

Please note that increasing these numbers should be done with care and provided there is enough free memory available in the main processor memory. This can be checked using show memory summary CLI command.

Router# show memory summary

Head Total(b) Used(b) Free(b) Lowest(b) Largest(b)

Processor 122EF3A4 1823542364 37453812 1786088552 699468844 1061370604

I/O 3BE00000 69206016 14926104 54279912 54226272 28441628

<----- Output ommitted for brevity ------>