TCP and more TCPs

TCP (transport control protocol) is much more than just a single protocol, it is a common transport protocol header format, all packets use this header format and have a uniform interpretation of header fields. But how the flow and congestion control are implemented is left to the system. This leads to many variants of TCP that attempt to optimize the utilization of channels and maintain fairness in network environments.

Consider a simple scenario that ties up to what TCP congestion control has been trying to achieve. Suppose you are driving along the highway for the first time. After traveling ‘X’ distance you encounter Cops, and you are handed a fine for speeding since you were going over the limit. Using this information, the next time you are driving; you don’t want to get fined yet drive fast (Not recommended, but still, you do!). In this scenario, you speed up to ‘X’, but slow down as you approach ‘X’. Now Cops are not at ‘X’ but some other place ‘Y’, now what we want to do is be cautious in our approach as we don’t know where cops might be and travel faster, so we slowly increase speed again until we are fined again.

This is the same approach TCP variants have been trying to use to optimize link utilization and not lose packets. If this scenario is intuitive, we may now consider different variants of TCP.

Another important thing before we start, TCP congestion control maintains variables/attributes in TCB (TCP Control Block), namely, ‘cwnd’ (congestion window), ‘ssthresh’ (slow start threshold), and others that contain information about the connection state, associated local processes, and feedback parameters.

AIMD (Additive Increase – increase congestion window by 1 MSS for each ACK and Multiplicative Decrease – decreases the ‘ssthresh’ by a factor of 2)

TCP Reno – Ack pacing flow control protocol. TCP makes use of “Ack clocking” (Acknowledgements are counted and there is no explicit timer/clock). In its steady state (Congestion Avoidance), Reno probes the network and steadily increases the congestion window by 1MSS every RTT.

Once the buffers are overflowed at the receiver’s end, it drops the packet and sends the ACK with the expected sequence number, and since this sequence number has already been sent, senders interpret this as packet loss and signal the sender of the receiver’s data reception rate. There can be two types of loss events; 1) 3 DUP Ack count, 2) timeout of a timer associated with the segment. # Dup ACK signals mild congestion while timeout event is interpreted as Severe congestion. TCP Reno cuts down its sending rate by half and this behavior keeps repeating. This is called the sawtooth pattern.

Additive increase if 1MSSS does not work with high-speed networks. It would take tremendous time (~ in hrs) to reach higher flow rates too assuming there is no packet drop, which is a strict constraint. Other challenges associated with Bandwidth products have been noted here

BIC (Binary Increase Congestion Control): when BIC performs a window reduction in response to a packet drop, it remembers the previous maximum window size as well as the current window setting. BIC increases the rate non-linearly, it inflates the rate by half of the difference between the current window size and the previous window size. It decreases the rate as it approaches the previous threshold value and halves the window inflation rate. The challenge with BIC is that it can be aggressive in low-RTT networks and in slower speed situations. CUBIC is an improvement over BIC as it introduces a higher-order polynomial function as a congestion control algorithm. Another point of difference is that CUBIC is a function of a timer associated with window reduction, rather than making use of BIC’s RTT counter. This makes TCP fairer in concurrent TCP sessions, that may have different RTTs.

CUBIC is more efficient in high-speed flows. There have underlying assumptions that we did not specify and may be useful to understand, but with this simplified overview, we can now move to newer variants. But if CUBIC is efficient, why do we need another TCP variant? To answer that we need to understand the basic Queue works.

Consider this, we have a pipe with a fixed diameter, and it can store some volume of water as it passes along to other pipes. The maximum flow rate is achieved when there is no water in the pipe and once this pipe gets full, water will overflow. But a more common situation is that it has some water. These three states in Queue are, No Queue, Queue Saturation and Queue Formation. The optimal point in this flow rate is the onset of the formation of queues, rather than the onset of overflow.

Reno tries to maintain the Queue formation state while CUBIC places the flow at the onset of Queue Saturation. This will lead to increased delays due to queue occupancy.

TCP Vegas detects congestion based on the increasing RTT time and does not wait for packet loss to infer congestion in the network, reducing the sending rate when RTT is high. When there are other flows in the network that do not utilize TCP Vegas, it loses out on the share of utilization to other networks because they are not yet decreasing their sending rate. Also, if a longer RTT is due to route change, the sending rate is dropped erroneously. Although it has major limitations the idea of delay–sensitive cloud congestion algorithm does make a lot of sense.

BBR - TCP delay-controlled TCP flow control algorithm from Google.

Bandwidth and RTT are influenced by several factors in addition to the data being passed through the network for a particular flow. Once BBR has determined its sustainable capacity for the flow, it attempts to actively defend it from being overcrowded by the concurrent operation of conventional AIMD protocols. The bottleneck capacity is the maximum data delivery rate to the ACK stream, over a sliding window of the most recent six to 10 RTT intervals. The intended operational mode is that the sender is passing packets into the network at a rate that is anticipated not to encounter queuing within the entire path. This is a significant contrast to protocols such as Reno, which tends to send packet bursts at the epoch of the RTT and relies on the network’s queues to perform rate adaption in the interior of the network if the burst sending rate is higher than the bottleneck capacity. For every received ACK, the BBR checks if the originally sent data was application limited. If not, the sender incorporates the calculation of the path RTT and the path BW into the current flow estimates. The probing mechanism is BBR is to spend one RTT interval to send at a multiple of BW-Delay product. Successive probe operations will continue to increase the sending rate by the same gain factor until the estimated bottleneck bandwidth no longer changes because of these probes. Because this gain factor is a multiple of the Bandwidth delay product, BBR’s overall adaption to increased bandwidth on the path is exponential rather than the linear adaption used by TCP Vegas. This regular probing of the path to reveal any changes in the path’s characteristics is a technique borrowed from the drop-based flow control algorithms. There is also the possibility that this increased floe pressure will cause other concurrent flows to back off, and in that case BBR will react quickly to occupy this resource by sending a steady rate of packets equal to the new bottleneck bandwidth estimate.

To get over Vegas giving off share to drop based TCP protocols, BBR achieves fair share due to its periodic probing at an elevated sending rate.

Now to understand challenges associated with BBR, we must first understand the QoS traffic Policing. There is an upper bound on flow rate imposed by the ISP based on the SLA for bitrate called CIR (Committed Information Rate). The link is capable of higher rate, but ISP limits the traffic based on what you are paying (called Traffic Contract). The traffic flow rate is limited by policing or shaping. The difference between the two is that policing will drop the exceeding traffic and shaping will buffer it. The policer measures the cumulative byte-rate of arriving packets and could take either of the actions, allow the packet to pass, drop the packet or remark the packet with a different DSCP or IP precedence value.

The bucket constrains the burst in traffic which cannot exceed this bucket value. The basic idea behind token bucket is the number of packets that should be passed to network is equal to the number of tokens in the bucket after replenishing, which is based on difference of packet arrival time and previous packet arrival time and police rate. The packets greater than this number may be discarded and this is where BBR might suffer. BBR may probe the Bandwidth at the time when bucket is full and set its sending rate base don that but then the bucket might get empty, and packets are discarded because they were sent with higher sending rate then accepted. This leads to upstream bandwidth wastage and increases the systems latency due to packet losses. To counter this, Google has added policer detection and explicit policer model to BBR. The feedback that the token-bucket policer impresses onto the flow control algorithm is one where an increase in the amount of data in flight is not necessarily going to generate a response of a greater measured RTT, but instead generate packet drop that is commensurate with the increased volume of data in flight.

Reference