Modern interactive applications are increasingly deployed at the edge, where the servers are placed close to the end users who want to avail of some interactive service. The reduced distance between the servers and the clients increases the frequency at which they exchange messages between each other. This increased frequency of communication makes it highly likely for the packets, from different applications, to collide at the cellular bottleneck simultaneously. When this happens, the bottleneck momentarily receives more packets than it can process at the line rate. This results in queueing and consequently, latency inflation for some of those interactive flows. The problem is observed to be more severe in the applications that exhibit request-response (i.e., bursty) traffic pattern.
Single Flow Bursty Traffic
These packet collisions closely resemble the incast traffic that has been observed and studied in data centers. Many congestion control solutions try to mitigate the incast problem in data centers. However, these solutions are designed for and evaluated in a highly controlled environment that caters to the data center use cases. It is difficult to assume such fine-grained control in a WAN or Edge-like environment, making this collision problem a bit difficult to tackle.
We further studied the effect of the synchronized arrivals of the messages on their latency. We observed that the synchronized flows experienced an increase in the tail latency when the utilization of the network was pretty low. I.e., the latencies were high when the total amount of messages being processed by the bottleneck was well within the bottleneck’s capacity. We tested for the case where the network had an RTT of 10ms and the bottleneck capacity was 200Mbps. In this setup, 16 flows were performing synchronized request-response messaging, where the client and servers are placed across the bottleneck. The latencies were seen to increase up to 1.2x-3x even when the network utilization was ~50%.
This is a quite unique problem, as it resembles the incast-like traffic pattern seen in data centers, and also because the inflation in the latencies is seen even when the average utilization of the network is low.
Memory-induced Host Congestion
As a part of CS 8803 Datacenter Networks and Systems, I worked with Prof. Ahmed Saeed, on re-evaluating the Host Congestion Control (SIGCOMM ’23) paper on newer Intel and AMD systems. This paper highlighted a new problem in the network communication datapath. Host congestion occurs when there is congestion within the end host trying to transfer the packet from the NIC to the main memory. A memory-intensive application running on the CPU might contend with the NIC to access the memory at the same time. Under severe memory bandwidth contention, the NIC is observed to drop packets, as it cannot DMA the packets to the memory at line rate. This results in degraded throughput and increased loss rates.
The authors of the paper evaluated their solution on Intel Cascade Lake architecture CPUs. My project involved finding whether the same problem exists on newer Intel and AMD CPUs. Most of the work went into implementing the right testing infrastructure for Intel and AMD separately. After implementing the test infrastructure, we were able to confirm that the newer Intel and AMD CPUs (on Cloudlab) were not experiencing the Host Congestion issue.
On a closer look, we realized that the authors of the paper had tested the problem on a setup where only one/two Memory channels were being used. This resulted in a very low memory bandwidth capacity and made it easy for them to saturate the bandwidth with very few cores and the network data transfer. On most of the setups that we have on Cloudlab, almost four to eight Memory channels are used per NUMA socket. Even using all the CPUs on the socket was not saturating the memory bandwidth to cause the network performance to degrade. Hence, we concluded that the problem discussed in the paper only occurs when the memory bandwidth of the host is limited.
This study aimed to identify the bottlenecks in the stacks when they try to run up to a million connections. We started by testing the sender-side stack’s performance. Here we tried to find the maximum throughput that we can achieve using a single core on the sender when subjected to a large number of flows. We observed the throughput to degrade by ~10Gbps when we increased the number of flows from 10k to 900k, even in the latest Linux kernel stack.