Computer Networks Notes 3: Transport Layer

1. Layered Network Design

As we have seen so far, IP provides end-to-end data transmission. In order to do so, it must rely on some lower layer protocol. For us this has been SLIP, but there are a number of lower layer protocols that can be used, including Ethernet, 802.11 (also known as Wireless Ethernet or WiFi), PPP, ATM, and so on. In fact, one of the desirable properties of IP is that it can run over a number of different lower-layer protocols. It is partly this property that helps make IP a universal technology, and the Internet a universal inter-network.

The International Standards Organization (ISO) has standardized an Open Systems Interconnect (OSI) model for talking about networks. In this model, IP belongs on the network layer, responsible for the end-to-end transport of data. There are seven layers, each with different responsibilities.

Layer 7 -- application layer, responsible for application-specific negotiations. Examples: HTTP, ftp, telnet, VoIP, SMTP, POP, IMAP, and many more.
Layer 6 -- presentation layer, responsible for encoding of data. Examples: MIME, ASN.1. Encryption such as provided by SSL and TLS may also belong on this layer (or may not).
Layer 5 -- session layer, responsible for maintaining state information about a user's connection to a system.
Layer 4 -- transport layer, responsible for end-to-end transmission of application data, including providing such services as reliability and congestion control. Examples: TCP, UDP, RTSP.
Layer 3 -- network layer, responsible for end-to-end transmission of packets. Examples: IP (including ICMP), ATM.
Layer 2 -- data-link layer, responsible for transmission of frames across a single hop. Examples: Ethernet, ATM, SLIP, PPP, 802.11 wireless, Frame Relay. As these protocols have evolved, a hop may now include multiple physical links and some forms of forwarding and routing, yet these protocols do not scale as well as IP routing, and are still considered to be on the data-link layer.
Layer 1 -- physical layer, responsible for transmission of individual bits across a single hop. Examples: serial lines, modems, cable modems, DSL, 100Mb/s CAT-5 Ethernet, 802.11b 10Mb/s wireless.

Splitting the protocol stack into these layers gives us a modular design for a networking protocol. Having a modular design is important for a number of reasons. A modular design helps us design protocols to do one thing well, and combine them with other protocols that do other things well, to build a predictable and well-functioning system. A modular design also helps us recognize when protocols are interchangeable -- for example, the fact that a data-link connection can be made with either WiFi or Ethernet can help us make the best decision in each given situation. Finally, a model such as this gives us an organized way of talking about networking protocols. When we say that a protocol or a system works on layer 4, we know it is concerned with the end-to-end features of data transport, or with the headers involved in supporting such features -- in the Internet, this usually means TCP or UDP, though other layer 4 protocols may be being designed now or in the future.

Drawbacks of this 7-layer design are also numerous. In particular, there are very few commonly used protocols in layers 5 and 6 -- such functions tend to be integrated into the corresponding layer 7 protocol. Also, having a modular way of talking about protocols does not mean the protocols themselves are modular -- as we shall see below, TCP has a number of dependencies on IP, and therefore an implementation of TCP could not be run unmodified on top of a different networking protocol. Finally, it is important to remember that this is a model, and real protocols often do not quite fit the model. TCP, for example, establishes connections and could therefore be seen as a session level protocol. On the other hand, a TCP connection can be lost, and will not automatically be re-established by TCP, so TCP is usually considered a layer 4 protocol -- but the distinction is subtle. ICMP is technically layered above IP, but in the OSI model belongs at the network layer with IP. Perhaps more dramatically, ATM (the Asynchronous Transfer Mode of SONET/SDH) can be used to build networks composed of many links, each running the ATM data link transfer protocol, or sometimes ATM-over-SONET or ATM-over-something-else. When we use ATM to carry IP packets, we think of ATM as a data-link protocol, even though ATM may be transporting packets over a large number of links. On the other hand, if we run voice directly over ATM, we think of ATM as a network layer protocol.

Finally, a major drawback of this model is that it says nothing about security. This is partly because it was developed during a time when network access was very limited, and therefore security was not as much of an issue, and partly because providing security is very challenging and, arguably, belongs on every layer of the stack.

Roadmap. Section 2 covered in detail the main network layer protocol, IP. Section 1 briefly looked at the application layer. This chapter will look at common transport layer protocols, with particular focus on TCP. Section 4 presents common data-link layer protocols, where necessary mentioning the physical layer, and Section 5 presents protocols used in public networks, that is, data networks provided by public carriers such as the telephone companies.

The order followed in this presentation is one of many suitable sequences. Many books only focus on the application layer. Many books begin with the application layer, then move up the stack, from layer 1 through layer 4. Other books start with the application layer and move down the stack. I find it desirable to motivate students by starting with a brief presentation of the application layer and the physical layer, representing simple but real networking systems (Section 1). This is followed by a focus on the glue that ties it all together -- the network layer (Section 2). It is only after studying what the network layer does and does not provide that it really makes sense to study the transport layer. Likewise, it is only after studying the needs of the network layer protocols that it really makes sense to study the data-link layer protocols (Section 4).

Since this is an advanced course, little or no time is spent on layer 1 and layers 5 through 7, assuming students have had an exposure to such information in a previous course, or, if not, can fill in the details themselves by studying the appropriate RFCs and other documentation. Instead, we focus on the protocols in layers 2 through 4.

2. Error Detection

We have seen already that the IP header carries a checksum whose function is to guarantee that the header is received without errors, and to cause the packet to be discarded otherwise.

There are three major drawbacks to the IP header checksum that we will look at in this section. The first is that the checksum does not protect the payload of an IP packet. The second drawback is that the checksum only protects against transmission errors, and does not protect against errors that occur within the router. Finally, although easy to compute in software, the checksum actually is not as effective at error detection as other algorithms, particularly cyclic redundancy checks or CRCs. The section closes with a look at the desire to avoid errors altogether by sending additional data, a technique known as forward error correction.

2.1 Transport-Layer Checksum

Since the IP checksum does not protect the payload, transport layer protocols that wish to detect errors packets with errors must implement their own checksum. We have seen that ICMP has a checksum that covers the ICMP header and the ICMP payload. Likewise, all transport layer protocols in the Internet compute a checksum over their header and payload, and transmit it as part of the header. In general, these follow the steps given in Section 2.5.

There is one exception to this scheme, and that is UDP. UDP is meant to be a lightweight protocol, and computing the checksum requires adding all the bytes in the message, potentially slowing down the networking stack. In addition, some applications, such as the Network File System, are only used over a local network -- if this local network provides a strong error detection algorithm such as a CRC, application performance can substantially improve if checksums are not computed and checked. Arguments such as these were used to justify making checksums optional for UDP -- if no checksum is computed, the checksum field is sent as all zeros.

We saw similar reasons given for why IPv6 does not have a header checksum. IPv6 does require that any protocol, including UDP, carried over IPv6 compute a header and payload checksum, so checksums are not optional for UDP carried over IPv6.

2.2 Pseudo-Headers and Payload Checksum

One problem with the IP header checksum is the number of errors it does not detect. In particular, consider a packet that is received by a router. During the interval of time between when the packet is received (and the checksum verified) and when it is sent again (and the checksum recomputed), the packet is stored in the router's memory. Anything that modifies the packet header during this interval will not be detected by the header checksum.

Such an undetected error is most likely to have no effect whatsoever. For example, a change in the TTL or the type of service or the packet identifier is unlikely to affect packet delivery. However, such an undetected error can also lead to packet misdelivery if it affects the source or destination address, the total length field, or the protocol field. These four fields are essential for correct packet delivery.

What can modify the packet header during the time the packet is stored in the router's memory? In the early days of the Internet it may be that memory errors were more common, but these days such errors are extremely uncommon. Instead, the most likely cause of such errors is software. A pointer error in the router program can accidentally change an address or another one of the crucial fields. Improperly implemented fragmentation can generate faulty headers, and the problem may remain undetected for a long time since fragmentation is only used infrequently.

The answer to these problems is to store a checksum to these four crucial fields in the transport-layer header. We have already seen that the transport-layer protocols all checksum their own headers and data. It would be relatively simple to store an additional checksum for these fields in the transport-layer header, but this would require an additional 2 bytes in the header. Instead, the designers of TCP came up with the idea of putting these four fields into a special header, and computing the checksum over the transport-layer header, the payload, and this special header. Both the sender and the receiver can build the special header from the IP header, and so this special header need never actually be sent -- it is for this reason that it is called a pseudo-header. The following figure compares what is actually sent, and what is used to compute the transport-layer checksum.

RFC 793, the definition of TCP, gives the following definition of the pseudo header.

+--------+--------+--------+--------+
|           Source Address          |
+--------+--------+--------+--------+
|         Destination Address       |
+--------+--------+--------+--------+
|  zero  |  PTCL  |    TCP Length   |
+--------+--------+--------+--------+

RFC 2460 gives following the definition for the pseudo header to be used by a protocol layered over IPv6.

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                                                               +
|                                                               |
+                         Source Address                        +
|                                                               |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                                                               +
|                                                               |
+                      Destination Address                      +
|                                                               |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                   Upper-Layer Packet Length                   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      zero                     |  Next Header  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Any significant change in the IP header while the packet is en route, and specifically any change in source or destination address, packet length, or protocol/next header field, will result in the transport layer checksum being computed over a pseudo-header at the receiver that is different from the pseudo-header used by the sender, and therefore in a checksum mismatch.

As shown in the figure, the pseudo-header is conceptually prefixed to the regular header when the checksum is computed, though since the checksum is commutative, actual implementations may use different orders when computing the checksum. A highly optimized implementation could actually pre-compute much of the header checksum for a given socket, only adding in on a per-packet basis those fields of the IP and TCP/UDP headers that actually vary from packet to packet.

2.3 Cyclic Redundancy Checks

The commutative nature of the checksum makes it easier to compute in software and to optimize the computation, but has negative consequences for detecting some classes of errors, notably errors where 16-bit quantities are exchanged. This is because adding 0x1234 to 0x5678 must necessarily give the same result as adding 0x5678 to 0x1234. Again, software errors are the most likely to cause this kind of problem.

Other problems are more likely to have physical causes. If an error is equally likely to strike any bit of a message, we think of it as a bit error. Checksums are very good at detecting bit errors, but a 16-bit checksum has a 1/32nd, or about 3%, probability of failing to detect when two bit errors have affected a message. That is because a second bit error has a 1/16th probability of affecting a bit in the same position as the first bit error, and a 1/2 probability of adding back to the checksum the value that the other bit error subtracted, or viceversa.

A final class of errors is usually detected properly by the checksum, and that is a burst error. One cause of burst errors is mechanical, as when a cable is being connected or disconnected. In a burst error, the probability of a bit being changed is increased if the bit is near other bits that have errors. Common bit errors are all-ones and all-zeros sequences caused by hardware malfunctions.

Even though checksums are easy to compute in software and effective at detecting burst errors, their weakness at detecting multiple bit errors has made them less attractive than a different class of error checking algorithms which have found widespread use: Cyclic Redundancy Checks. These are easy to compute in hardware and can be computed efficiently in software -- though not as easily as checksums.

The fundamental principle of a CRC is that of dividing a large binary number, the message, by a fixed binary number, the CRC generating polynomial. The remainder of the division is then sent with the message as the CRC check string. The receiver performs the same operation, and if the result matches, the receiver assumes that the original dividend, the message, is unchanged.

In order to implement the division efficiently in hardware, a special kind of arithmetic is used, arithmetic among polynomials with either 0 or 1 as their coefficients. This reminds us of the IP checksum using one's complement arithmetic rather than the more traditional two's complement arithmetic. A polynomial with 0 or 1 as coefficients is simply a sum of powers of x, for example, 1 * x³ + 0 * x² + 1 * x¹ + 1 * x⁰, which is simply x³ + x + 1.

There are many advantages to using such polynomials. The first is that they are easily represented as bit strings: the above polynomial can be written as 1011. The second advantage is that addition and subtraction are the same: if polynomial p1 = x³ + x + 1 and polynomial p2 = x³ + x² + 1 are added or subtracted, in either case the result is p1 + p2 = p1 - p2 = x² + x.

To divide the message by the CRC polynomial, we need to shift the message past the polynomial, and, whenever the most significant bit of the result is a one, subtract the CRC polynomial, that is, invert the bits corresponding to ones in the polynomial. The following figure shows the computation for one standard CRC called CRC-8, for which the polynomial is x⁸ + x² + x + 1. There is an XOR gate for every coefficient of the CRC that is 1, and no XOR, since no subtraction is needed, corresponding to the coefficients which are zero. The most significant bit of the polynomial is always one, and does not correspond to an XOR gate, since it is always subtracted if the corresponding bit in the shift register is one, and never subtracted if it is zero.

After the entire message has been shifted in, the CRC is found in the 8 bits of the shift register.

As can be imagined, the selection of a CRC polynomial is done carefully to detect as many errors as possible. The coefficient of x⁰ is always one -- this is sufficient to detect all single bit errors. However, CRCs in practical use also detect double-bit errors, and many can also be used to detect multiple bit errors and to correct errors.

2.4 Forward Error Correction

Assume an error check scheme, perhaps a CRC or perhaps another scheme, that detects all single-bit errors and all double-bit errors. If the scheme tells me I have received a message with a single bit error, it might be possible for me to reconstruct the original, correct message, simply by changing each bit in the message, recomputing the check, and seeing if the result shows no errors. If so, I can be reasonably certain that I have fixed the error. This depends on the ability to detect multiple errors -- otherwise, I might be unable to distinguish between having fixed the error, and having introduced a new error that happens to cancel out the original one. In addition, it also depends on multiple bit errors being much less likely than single-bit errors. For example, if a three-bit error is as likely as a single-bit error, my error check scheme may not be able to distinguish the two, and may erroneously declare and correct a single-bit error where in fact 3 bits were changed.

Being able to reconstruct a corrupted message is called forward error correction -- forward because enough additional information is sent by the original sender, whether errors occur or not. In contrast, Section 3 (below) considers error recovery by retransmission of the data, which is not a form of forward error correction. Forward error correction always requires redundancy, or sending of more data than the receiver would need in the guaranteed absence of errors.

A simple form of forward error correction is to send each packet three times, each time with the same packet identifier -- this is typically different from the IP packet identifier. If a receiver receives multiple identical packets with the same packet ID, it can discard all identical duplicates. If one of the packets is lost, the receiver can still reconstruct the message. If one of the packets is corrupted, by comparing the packets the receiver can try to reconstruct the original message: if two copies are identical and the third is different, the receiver can assume that the identical copies are correct and the one that is different has been corrupted.

The same strategy of triple redundancy can be used on a byte level rather than a packet level, making it possible to correct for more errors -- for example, the first packet could have an error in the first byte, and the third packet an error in the 15th byte, and both these errors can be detected and corrected.

More subtle strategies for forward error correction are the above-mentioned use of CRCs, and sending a parity packet for every N packets, each packet with its own checksum or CRC. In this case, we can tell which packet is corrupted, and we can use the parity to reconstruct the packet from the values of the other N-1 packets. This is similar to the RAID technology used to deal with disk errors.

Forward error correction is very useful when transmissions have a long latency, as they do when earth stations are transmitting to space probes -- at the speed of light, it takes several hours, and sometimes days, to reach the probe. In such cases, burst errors are sometimes caused by astronomical phenomena, and so it is advantageous to transmit the multiple copies at considerable intervals. Another case where forward error correction can be advantageous is in satellite transmissions, which have sufficiently large latencies that many hundreds of milliseconds can go by before the recipient finds out that data has been lost. This can be generalized to the situation where data must be transmitted in a time that does not permit retransmission, that is, to most real-time transmissions.

Wherever real-time is not an issue, forward error correction is wasteful, because it requires sending data that will only be used if there are errors. In general, it is more efficient to retransmit data instead.

3. Retransmission

If I wish to send data reliably, and the receiver is cooperating, I can simply attach a number to every item of data that I send. The receiver can then send me a response that says the data has been, or has not been, received. This is the basis of reliable transmission by retransmission, and the principles are:

each packet carries a sequence number which the receiver can predict. This sequence number is used to correct packet duplication and reordering as well as to request retransmission. In some designs the sequence number increases by 1 for every packet sent. In other designs, including TCP, the sequence number increases by the number of bytes carried in the packet.
A packet that is received is acknowledged by the receiver sending back to the sender a special acknowledgement packet, or ACK. In the acknowledgement the roles of sender and receiver are reversed, so that the receiver of the data sends the acknowledgement, and the sender of data receives the acknowledgment.
The ACK carries the sequence number of the packet that was received, or, as in TCP, of the next packet that is expected. ACKs are either individual, acknowledging (acking) a single packet, or cumulative, as in TCP -- a cumulative ack acknowledges all packets or all data up to and including a given sequence number.
A receiver can tell a sender that a packet was not received by sending a negative ack, or NAK. NAKs are used less commonly than ACKs because they require the receiver to know when to expect a packet from a sender. Sometimes this is obvious: for example, if packet sequence numbers 1, 2, 3, 4, 6, 7, 8 are received, the receiver could NAK the packet with sequence number 5. On the other hand, if packet 8 were missing, the receiver might not know this and would be unable to send a NAK.
TCP only has ACKs, but a string of 3 or more "duplicate" acks (that is, acks that have been sent before) is used to communicate that the packet after the one being acked is missing. To achieve this, TCP will send a duplicate ack for any segment received that cannot immediately be acknowledged because of some gap in the segment sequence -- sufficiently many such segments will cause three or more duplicate acks to be sent. The duplicate ack simply acknowledges again the last segment received in sequence. In the example above, TCP could ack segments 1, 2, 3, 4, 4, 4, 4, sending 3 duplicate acks and letting the sending TCP know that segment 5 has not been received.

We therefore see that we have a number of choices.

We can use positive acks, negative acks, or both. Positive acks are more reliable in the case that the receiver does not have advanced knowledge of what is sent, but negative acks require less overhead (when the packet loss rate is low) and might lead to faster retransmission whenever they are suitable. The lower overhead is because most packets do not get lost, and therefore we would expect to send fewer NAKs than ACKs for a similar system. The faster retransmission is because if a packet or ack is lost the sender may not know exactly how long to wait before retransmitting, whereas the NAK can promptly provide that information. NAKs are less reliable, however -- consider what happens if a NAK is lost. As a result, systems either use ACKs only, or a combination of ACKs and NAKs. TCP basically uses ACKs, with the 3 duplicate acks giving the performance of NAKs in an optimization called fast retransmit, described in RFC 1122 and with additional details in RFC 2001.

The next choice is cumulative acks/naks, or individual acks/naks. A cumulative ack is advantageous for a number of reasons: if many packets are received in close succession, it is more efficient to send a single cumulative ack for all of them than sending individual acks. Also, if an ack in a series of acks is lost, the next ack will correctly and automatically update the information at the sender. On the other hand, individual acks can also provide benefits. For example, a NAK becomes unnecessary since the sender can see exactly which packets the receiver has received, and which need retransmitting.

Finally, there is the choice of numbering packets or bytes. Either way, a single sequence number is actually transmitted -- TCP only explicitly transmits the sequence number of the first byte in a segment, and the sequence numbers of the other bytes can be computed from their position within the segment payload. One consideration for distinguishing these two numbering schemes is how many bits the sequence number needs. Most retransmission schemes are designed to work no matter how much data is sent, but for any sequence number with a finite number of bits, eventually the sequence numbers will wrap around. This can be seen most dramatically in the alternating bit protocol, a simple retransmission protocol where successive segments contain sequence numbers 0, 1, 0, 1, 0, ... To use this protocol, at most one packet must be unacknowledged at any given time. To think of this protocol, consider the following possibilities:

Transmission is successful. The receiver receives the next sequence number, and acks it. The sender receives the expected ack, and sends the next packet.
The packet is lost. The sender retransmits with the same sequence number, and the receiver gets it and acks as in the previous case (the receiver probably doesn't know the packet was lost).
The ack is lost. The sender retransmits with the same sequence number. The receiver can tell from the sequence number that this is a duplicate packet, and discards it after sending an ack.

These scenarios are also illustrated in the following figures, which shows timelines, graphs in which time increases downward and packets are represented by arrows.

These are the only possibilities, and it can be seen that the following invariants hold:

A duplicate packet always carries the same sequence number as the last correctly received packet.
A correct ack is only received by the sender after the receiver has correctly received the packet.
There are no duplicate acks in this system.

The situation becomes more complicated if there can be several unacknowledged packets at any given time. In general, the sequence number space must be large enough for twice the maximum number of unacknowledged packets. We can see this by imagining a system where the sender can send up to N packets before receiving the first acknowledgement -- if you work well with concrete examples, imagine that N = 128. Packet numbers are 0..N-1. The acknowledgment for the last packet, however, is lost, so the sender retransmits the last packet (number 127), as well as the next N-1 packets (128-254). If the sequence number space is less than 2N-2 (254), then some of the new packets will have the same sequence number as the old packets that have just been acknowledged.

The problem with the above scenario is that the receiver has no way of telling which of the newly received packets are retransmissions and which are actually new packets. The solution is to make sure the sequence number space is 2N or larger, which allows the distinction to be made.

Having said all this, we can now return to the original question of whether to number packets or bytes. If the sequence number has b bits, then at most 2^b-1 packets or bytes can be unacknowledged at any given time. If this is a limitation for our protocol, that is, if b is relatively small, then it is better to count packets. Alternately, if we want to count bytes, we must make sure that b is at least as large as the maximum number of unacknowledged bytes. TCP does this, setting b to 32, so that the sequence number is a 32-bit integer. Because the window field is 16 bits, the maximum number of unacknowledged bytes in TCP is 2¹⁶-1 = 65535, so in TCP it is generally possible to say which packets are new and which are retransmission. There are exceptions -- an old packet might be transmitted by a router after a long time, and get TCP to accept incorrect data, but the chances of this are very small. A more interesting exception is a TCP extension, window scaling, which allows up to 2³¹-1 ~= 2GB of data to be unacknowledged at any given time.

Further protection against TCP sequence number wrap-around is provided by the optional TCP timestamps, introduced in Section 4 of RFC 1323.

The main benefit of having TCP sequence numbers keep track of bytes rather than packets is that it is possible to re-packetize the data on retransmission. For example, assume a program sends first one byte, then another, then a third. This could happen if the program is a terminal program transmitting a user's key presses. If all these packets are lost, then in TCP all three bytes can be retransmitted using a single packet.

This does not seem like a great benefit -- after all, retransmissions are few, and retransmissions where we can coalesce several packets are probably even fewer. However, it fits well with the TCP philosophy of providing a stream of bytes rather than a stream of packets, so perhaps the choice was partly philosophical.

In TCP, both sides can send at the same time. Also, the acknowledgement is simply one field in the TCP header -- every TCP segment (except for the initial SYN segment which sets up the connection, described below) must carry a valid acknowledgment. A TCP ACK is simply a TCP packet with zero bytes of payload, but it is important to remember that a TCP data packet also always carries a valid acknowledgement.

3.1 Maintaining the TCP adaptive timer

Finally, we consider how long to wait before retransmitting. Logically, TCP must start a timer whenever it sends a packet. Since round-trip delays in the Internet vary considerably, from a fraction of a millisecond on a LAN to many hundreds of milliseconds on long-distance satellite connections, the designers of TCP adopted the principle that the timer should be set based on the RTT measured from sending a packet (ignoring retransmitted packets in this calculation). When an ACK is received, a packet can be removed from the retransmit queue, and then the time it was originally sent (if it was never retransmitted) can be compared to the current time. The difference is the current measured RTT, RTT_i.

This RTT is used in two ways, to keep a running average of the estimated RTT as well as a running average of the difference between measured RTT and average RTT. A running average RA_i of a quantity is simply an average computed from the previous running average (RA_i-1) and the newest sample S_i by the formula RA = alpha S_i + (1 - alpha) RA_i-1 for a given constant 0 < alpha < 1. A value of 1/8 for alpha means the running average is about 7 parts the old average, and 1 part the new sample, or in other words, if the samples suddenly change from having value X to having value Y, the running average moves close to Y about 8 samples later. The value 8 is chosen because, as with any power of two, it makes it easy to compute such a running average using integer arithmetic:

  int ra = 0;  /* or whatever the initial value is */
  int next_sample;
#define ALPHA_BITS 3  /* corresponding to alpha = 1/8 */

  while (1) {
    next_sample = get_sample ();
         /* (1-alpha) * RA */      /* alpha * S */
    ra = (ra - ra >> ALPHA_BITS) + (next_sample >> ALPHA_BITS);
  }

The two TCP running averages are RTT, with an alpha of 1/8 (RA_RTT), and deviation between measured RTT and running average RTT, with an alpha of 1/4 (RA_Dev = abs (RTT - RA_RTT)). The timeout TO is then set to TO = RA_RTT + 4 * RA_Dev. In other words, if the deviation is small, the timeout is set to approximately the average RTT, so a lost packet or lost ack is detected quickly and retransmission can also occur quickly. On the other hand, if the deviation is large, the connection is suffering considerable unpredictability, and TCP is careful not to retransmit until after allowing ample time for the ack to arrive.

4. Flow Control

One possible reason for loss of packets might be lack of buffer space on the receiver, in which to store incoming packets. This is likely under a number of circumstances, all of them involving the receiver being unable to process data at the same speed as the sender is generating data. To avoid this, many networking systems, including TCP, let the receiver control the amount of data the sender may send at any given time. This is called flow control. TCP does flow control specifically by limiting the number of bytes a sender can transmit before receiving an acknowledgement.

We mentioned above that the maximum number of unacknowledged bytes in TCP is 2¹⁶-1 = 65535. The TCP field that the receiver uses to communicate this value to the sender is called the window, and is carried within the acknowledgement packet. The TCP window is a 16-bit field, and the interpretation of ack number A and window size W is that the sender is allowed to send up to and not including sequence number A+W. For example, with an ack number 100000 and a window size of 300, I am allowed to send bytes up to and including sequence number 100299, but not sequence number 100300.

The mechanism of having the receiver tell the sender how many bytes it may send is called a sliding window mechanism. The idea is that the window is those sequence numbers that the sender may transmit at any given time. As the sender sends packets, the left edge of the window slides to the right, and the window becomes smaller. When the receiver acknowledges, it may keep the right edge of the window where it was, or it might slide it to the right, as it must do eventually to permit the sender to send more.

The receiver has to set the window size based on the amount of buffer space it has. If the receiver has lots of buffer space, it can send a large window. If the buffer space fills up because the application is not consuming incoming data, the receiver should let the window shrink, to size zero if necessary. Once the application reads the data, the receiver can send a duplicate ack to enlarge the window.

One problem with this scheme is that the duplicate ack sent with the larger window may be lost, and the sender and receiver will then each be waiting for the other to allow communication. To avoid this problem, TCP requires that a sender that has data to send but a zero window should keep transmitting 1-byte packets at a low rate. If the receiver has no buffer space for these packets, it should discard them and acknowledge the last valid sequence number, providing a confirmation that the window is still zero. If the ack enlarging the window was lost, however, the receiver now will accept and ack the new data, and the sender will be able to send additional data.

Another problem with this scheme is that the amount of data an application reads at any one time may be very small. If that happens, the sender will then only have a small window, and only be able to send small segments, whereas sending larger segments generally has lower overhead (because a smaller number of headers is sent for the same number of data bytes). This is called a silly window syndrome, and to avoid it, TCP suggests shifting the right edge of the window in units of the maximum-segment size, or MSS, both on the sender and on the receiver. Unfortunately, this is not always possible. TCP uses a number of strategies to deal with this, including waiting for all sent data to be acknowledged before sending a small segment, and setting a timer to use a small window if that is all that is available. In order to avoid sending small segments whenever possible, TCP also waits to send small amounts of data (less than 1/2 the MSS) if there is unacknowledged data outstanding. This is called the Nagle Algorithm, and leads to data from successive send calls being combined in a single segment and being received as a unit.

Finally, a receiver could shrink the window by shifting the right edge of the window to the left. For example, when acking byte 100000 I would advertise a window of 10000 bytes. Then, after receiving 5 bytes, I could send ack 100005, and a window of only 300 bytes. The problem with this is that bytes past 100300 might already be in transit, and I would be left with the choice of either storing them and acknowledging them, or discarding them. The latter is undesirable, and therefore TCP does not allow receivers to shrink the window.

4.1 Bandwidth-Delay Product

A sliding window design is a way to slow down a sender when appropriate, that is, when additional data sent by a sender would have to be discarded by the receiver due to lack of buffer space.

However, a sliding window mechanism can also slow down a sender inappropriately. In particular, consider two hosts connected by a single link with bandwidth B bits per second and round-trip time RTT. In the absence of a sliding window, the sender could send up to B bits per second. With a sliding window of size W, however, a sender can only send approximately one window every RTT, so the maximum rate permitted by the window is B' = W / RTT. This is true regardless of the speed of the receiver.

If B' < B and if the receiver is fast enough to receive at least B bits per second, then the speed at which the sender may send is limited by the window W rather than the network bandwidth B.

Multiplying both sides of the equation by the RTT, we find that the minimum window to send at full speed must be W ≥ B * RTT . The product B * RTT is known as the delay x bandwidth product or the bandwidth - delay product.

For example, a 1Gb/s satellite link with a round trip delay of 300ms would have a bandwidth-delay product of 1Gb/s * 0.3s ~= 300Mbit or 37.5MByte.

Note that link speeds are usually given in bits/second, whereas everything else, including window sizes, are usually specified in bytes (or sometimes in packets), so remember to convert as appropriate.

On this link, a TCP transmission with a window size of 32KB would be able to send about 3.3 windows/second, or about 100KB/s, i.e. 800Kb/s. This is only 0.08% of the available link speed. Clearly, for good link utilization, a bigger window (at least 37.5MB for full speed) is needed. This may mean that the receiver will need about that much buffer space to avoid discarding data -- on a modern machine this is not a problem, though in practice it may require negotiations between the application and the operating system.

In general, the bandwidth-delay product is only significant on networks where both the bandwidth and the delay are large. A modem link, for example, is high delay but low bandwidth, and a relatively small window is then acceptable. A LAN may be high bandwidth but low delay, so again a small window is unlikely to unduly constrain traffic. In contrast, the satellite link described above, or other high-bandwidth WAN connections, including fiber optic and microwave links, will require large TCP windows to achieve acceptable performance with TCP.

A final issue in supporting large windows is the TCP window size field, which is only 16 bits wide, allowing window sizes of at most 64KiB. A whole RFC, RFC 1323, is devoted to TCP extensions for High Performance, and covers not only the issue of limited window sizes but also measuring the round-trip time and protecting against wrapped sequence numbers. The core of the RFC is only about 25 pages long, and is recommended reading.

The RFC mentioned before, RFC 1323 introduces a new TCP option, the window scale option. This option is only sent with a SYN packet. If both sides send this option (both sides sending the same option is sometimes referred to as a "negotiation", a term that is popular but not literally accurate), then scaling is in effect. The basic idea is that the 16-bit TCP window will still be used to indicate the window size, but the receiver of a TCP header will compute the actual window size by multiplying the 16-bit window value rcv.wnd, received in the TCP header, by a number n = 2^x, so w = rcv.wnd * 2 ^x. Since this multiplication is by a power of two, it is usually implemented by shifting left by x bits. The sender has to do the opposite scaling (shifting) operation before placing the value in the snd.wnd field of outgoing packets. The value x is transmitted as an 8-bit value in the initial TCP window scale option, and is valid for the entire connection.

5. Congestion Control

A receiver with more data than it can handle must discard the data. The same is true for a router receiving data faster than it can be forwarded. Since router queues are finite, eventually a router experiencing this must drop some of the incoming data. This situation is called congestion.

Retransmission as described so far does nothing to help relieve congestion. In fact, when packets are dropped, they will be retransmitted, possibly even increasing the amount of data sent through each router. Increasing the amount of data sent in response to congestion can cause an instability in the Internet known as congestion collapse. In congestion collapse, the equipment works fine, but the fraction of data that makes it to its final destination becomes small, so the effective throughput of the network is suddenly smaller. The problem comes and goes unpredictably, and is hard to pinpoint since all the hardware and software in the network are actually working as designed -- the problem is with the end-systems reacting inappropriately to congestion in the network.

The appropriate behavior for an end-system that is experiencing congestion is to dramatically reduce the sending rate. This is true whether or not the system is the major contributor to the congestion -- if all senders dramatically reduce their sending rate, the congestion will disappear. So a sender must do be able to do two things: detect congestion, and reduce its send rate. These are treated in sections 5.1 and 5.2.

5.1 Detecting Congestion

The simplest way to detect congestion would be for a router to tell a sender to slow down. After all, the router knows it is congested, and all it has to do is tell the senders. This is exactly the purpose of ICMP source quench packets, as described in RFC 792. Alternately, it is possible to imagine a congested router modifying ACKs for flows that are congesting the router. For example, a router could reduce the size of the flow control window, causing the senders to slow down.

The problem with all these schemes is that it requires routers to do additional work, at the very time that the router is already working as fast as it can. A router is optimized for forwarding packets, and may not have the resources to send additional messages as fast as required in case of congestion. The source quench mechanism even requires a router to send additional packets, which is unwise if the entire network is suffering congestion. The reverse-route mechanism can fail if ACKs do not follow the same route as data messages, and also must match packets going in one direction against packets in a queue going the other direction, a process that is likely to require substantial resources.

An alternative mechanism, not implemented in the Internet but used in other networks, is to have flow control between each sender and the next-hop router, and between each router and its next hop. This can work, but is only really effective if routers do keep track of flows and utilization -- that is, if routers are connection-oriented. As we saw in Section 2, IP is designed to be connectionless for a variety of very good reasons, so this sort of connection-oriented congestion control is not very practical in the Internet.

Finally, there is a mechanism in the Internet that is relatively recent, and is known as Explicit Congestion Notification, ECN, and defined in RFC 3168. With this mechanism, a sender that is ECN-capable sets the two bits in the least-significant part of the Type of Service field (Differentiated Services field). These two bits have a value of 00 if ECN is not supported, 01 or 10 (usually 10, also known as ECT(0)) if ECN is supported by this packet and this packet has not yet encountered congestion, and 11 if this packet has encountered congestion. The recipient of an IP packet sent by a sender that is interested in supporting ECN can therefore know whether a packet experienced congestion, as long as the congested router(s) is/are marking ECN-capable packets.

There are also ways that the end-systems can detect congestion directly. When congestion is occurring packets are dropped, and a reliable protocol can detect this. There are many possible causes for packet drop, but in a properly functioning network, congestion is the most likely. Also, if the reaction to congestion is to slow down, then certainly it is not unreasonable to react to packet drops as if they were all due to congestion, since the only disadvantage to doing so is a temporary reduction in performance. This is the strategy taken by the earliest TCP standard for congestion control, TCP Reno.

Packet drop shows that congestion is already occurring. However, it is possible to detect congestion before it occurs by measuring the round-trip time of packets. Again, a retransmission system can keep track of this by matching acknowledgments against the send time of a packet (ignoring retransmitted packets in this calculation). If the average round-trip time is on the increase, it is likely that congestion is beginning to occur, and that a reduction in the sending rate could prevent this. Because the congestion has not yet occurred, it may be that we will not have to reduce the sending rate as much as we would if we were reacting to a packet drop. Although this seems like a good idea, it has not been adopted as a TCP standard.

5.2 Reacting to Congestion

The congestion control mechanism in TCP Reno, described in RFC 1122, has the sender do all the work of avoiding congestion. The sender keeps track of the window it receives in the acks. This window is the flow control window described in Section 4. The sender must also keep a different window called the congestion window. At any given time, the sender must send no more bytes than the smaller of these two windows.

In explaining the algorithm for computing an updated congestion window, MSS is the maximum segment size that a TCP can send in one packet. The congestion window starts at 2 MSS and in state slow start, and the following algorithm is used.

If we get an ack newly acking N bytes,
- A. if the state is slow start, increment the congestion window by N. If all goes well, this doubles the size of the congestion window once every RTT. If a threshold is set, and we have reached it, move out of the slow start state.
- B. if the state is not slow start, increment the congestion window by an amount MSS proportional to the inverse of the window size. Specifically, with a window size of W, increase the window by N * MSS / W. For example, if the current window size is 10 MSS and the new ack acknowledges 1 MSS, increment the window by 1/10 of a packet.
if we get a timeout, indicating packet or acknowledgement loss, then set the threshold to 1/2 the current congestion window, set the congestion window to 1 MSS, and enter slow start.

To understand this algorithm, we have to first realize that it works in cycles of round-trip time (RTT). Each RTT, the sender is allowed to send a full window worth of data. It may not send a full window's worth of data, but this is the largest possible amount. Also, the send window (flow control window) may be smaller than the congestion window, but the worst case is that the sender will send a full congestion window worth of data each RTT.

With this understanding, let us look closely at the algorithm. If the state is slow start, each acknowledgment we get increases the window by 1 packet, that is, each acknowledgement for 1 packet allows us to send 2 packets. That means that if we send a full window's worth of data in each RTT, the window will double each RTT. This is exponential growth, and hardly what one thinks of in terms of "slow" start. Slow start is only slow when compared to the practice of sending a full window's worth of data immediately as fast as possible, which was normal before congestion control was implemented.

Eventually, as we double the congestion window size, we may cause or otherwise experience congestion and one or more packets will be dropped. In this case we go to part 2, which tells us to reduce the congestion window size back to 1 packet and start slow start again. This time, however, we remember the rate at which we experienced congestion, and only allow slow start to reach half of that rate -- the threshold. Once the window reaches the threshold, we switch to part 1.B of the algorithm, which increases the window size much more slowly. If a full window's worth of packets is sent in each RTT, then in each RTT the congestion window grows by one packet. This is linear growth, and continues until once again congestion is experienced.

We can actually observe congestion control in action by monitoring a transfer of a large amount of data over a slow link. The data rate will change constantly, as the sending TCP continually searches for a sustainable transmission rate. When a packet is lost, TCP slows down. When packets are all being delivered, TCP speeds up.

This basic algorithm is known as TCP Reno.

Since the early 2000s, a variant of TCP Reno called Cubic has become popular to optimize transmission over networks with high bandwdith delay product. The main difference between Reno and Cubic is in the algorithm for setting the window size. Instead of incrementally setting the window size based on acks received, the congestion window is set to a cubic function of the time since the last loss event. At the loss event, the window size W_loss is recorded. The cubic function is designed to increase quickly soon after the loss event, similarly to the exponential growth of slow start. The growth of the cubic function slows as the window size approaches W_loss. As the window grows past W_loss, the window size can grow more and more quickly.

Comparing Cubic and Reno, note that, given a fixed maximum bandwidth that a sender can send and ignoring the slow start phase, Reno reaches that bandwidth, then has to reduce its sending speed to half that bandwidth, so that on average, Reno only sends at 3/4 of the link bandwidth. Cubic, in contrast, spends most of its time increasing the window size slowly near the link bandwidth, giving a higher average data rate.

Further, the window in Cubic grows independently of round-trip time, whereas TCP Reno lets the window grow faster if the RTT is shorter. As a result, Cubic is considered fairer as well as faster.

Over the years TCP Reno and its intellectual heirs, including Cubic, have been refined in a number of ways. The first refinement says that if three duplicate ACKs are received, TCP should not wait for a timeout, but immediately retransmit the next packet. The reasoning is that the receiver will transmit duplicate acks if it receives out-of-order packets, and this serves as a form of negative acknowledgement, or NAK (NACK) that the segment right after the one being acknowledged has gone missing. This is called fast retransmit, and is only effective when a large amount of data is being sent, so at least three out of order packets are received following the missing segment. In case of fast retransmit, the congestion window is halved, and the linear increase begun immediately, without slow start. This is called fast recovery, and is the strategy to be used when an Explicit Congestion Notification is received from the peer.

When an IP receives a packet with the Explicit Congestion Notification, it informs its receiver. It is then the responsibility of the higher layer to respond appropriately. For TCP, this means setting the ECN-Echo (ECE) bit in the TCP header of every subsequent ack, until a TCP segment is received with the Congestion Window Reduced (CWR) bit set. This bit is only set in one segment for every time a sender reduces its congestion window to half the previous value. If the segment is lost, the sender will reduce its congestion window and retransmit a new segment, again with the CWR bit set. TCP should reduce its congestion window at most once per Round-Trip Time (RTT).

To negotiate use of ECN, the initial SYN packet should have both ECE and CWR set, and the responding SYN+ACK should have only ECE set. If both of these hold, then the TCP peers may use ECN in this connection, and each of them must respond appropriately to congestion notifications received from the other side.

While it is arguable whether the congestion detection mechanism of TCP is the best it could be, there is evidence that the congestion control mechanism of TCP Reno and Cubic is very appropriate for guaranteeing stability of the Internet. Ignoring slow start, TCP Reno increases the window (and therefore the sending rate) linearly when there is no congestion, and decreases the window by a factor of two when congestion is detected. This is summarized by saying that the TCP sending rate has an additive increase (a constant is added to the window each RTT), and a multiplicative decrease. Any mechanism that has an increase more aggressive than additive increase or a decrease less aggressive than multiplicative decrease can be shown to cause congestive collapse, that is, prolonged periods of congestion where little useful data is delivered.

Similar to Reno, Cubic decreases the window size very quickly. The increase is relatively slow, especially as the window size approaches W_loss. So while literally not AIMD, Cubic has the behavior that we would expect from a stable congestion control algoirithm.

There are many protocols that do not provide additive increase and multiplicative decrease. Many current real-time protocols are one example, and UDP is another. Redesigning some of these protocols so they do not cause congestion collapse in the Internet, while still providing useful levels of service, is an area of active research.

Since TCP reacts to packet loss by slowing down, it would be possible for routers to discard a few packets before congestion occurs, thereby requesting that senders slow down. This strategy is known as Random Early Discard or RED. Discards are random so that the router does not have to expend resources trying to identify the flows that are congesting the queue the most -- a random selection is most likely to discard packets from flows contributing the most packets to the queue. Random Early Discard works also with Explicit Congestion Notification, marking packets as having experienced congestion rather than dropping them.

6. Signaling: Connection Management

We have seen that each system using TCP to communicate must maintain some state about the communication. At the very least, this state must record the sequence numbers in use in both directions of the communication, the window sizes, and any packets that have not yet been acknowledged and may therefore have to be retransmitted. This state is also referred to as a connection. Originally, a connection referred to a dedicated pair of telephone conductors used for voice communication. Because TCP streams data in ways analogous to a telephone connection, we think of a TCP communication as occurring over a connection, though in fact there is no dedicated set of wires -- TCP communicates over the connectionless IP, which normally communicates over shared links. So the only expression of a connection in TCP is the state that each endpoint must maintain.

TCP assumes that this state is stored in a block of memory called a Transmission Control Block, or TCB. A block of memory may sound mysterious, but most languages provide operations to create and manipulate blocks of memory -- in C we call them structs, in object-oriented languages we call them objects. TCP expects that a separate TCB will be allocated for each end of a connection. For example, if I have established 10 TCP connections, I must have 10 TCBs. My peers in those connections must also have a total of 10 TCBs stored on their own machines. In the Unix world such an endpoint of communication is called a socket, and the term has been adopted outside the Unix world.

In order to manage a TCB, the peers must agree on:

when to create the TCB
when to de-allocate the TCB

Interestingly, these two operations also correspond to the major phase transitions for connections:

when a connection is not established and we wish to establish it. In this case we must exchange initial information to allow for the correct initialization of the peer's TCB.
when a connection is established and we wish to close it. In this case we must make sure that all data sent has been correctly received, and then deallocate the TCB.

All the information needed to initialize a peer's TCB is included in every TCP header. This header includes the sequence number of the packet, an acknowledgement number and a window size, and bits to report

whether this is the first packet in this direction for a connection (SYN -- synchronization packet)
whether no more data will be sent by this sender on this connection (FIN -- finalization packet)
whether the acknowledgment field is valid (ACK -- this is set in every packet other than the initial SYN).

The exchange of information to create or take down a connection is known as signaling. The term originated in the world of Plain Old Telephone System (POTS) technology, where central offices signal to each other to establish a literal connection, and is less commonly used in the context of the Internet.

The client normally initiates a connection (corresponding to the connect system call) by selecting an initial sequence number and sending a SYN packet to the server. This initial packet is the only one sent where the ACK bit is clear (i.e. 0, not set). The initial sequence number is selected so as to try and avoid accidentally matching any sequence numbers that may have been used recently for prior connections.

A server receiving such a packet checks to make sure it has a socket in the listen state, and accepts the connection, meaning it creates a TCB to correspond to this new socket and allows a corresponding call to accept to complete. The server will also create a response packet with both the SYN and ACK bits set. The sequence number is selected as above, and the ACK number is the received sequence number, plus one. For example, suppose I receive a TCP SYN packet with sequence number 100000. I select my own sequence number to be 7654321, and send a SYN+ACK packet sequence number 7654321, ack number 100001. The client will acknowledge my SYN+ACK with an ack packet (a packet with only the header, and no data) with sequence number 100001 and ack number 7654321. The next packet I will send (unless retransmission is necessary) will carry sequence number 7654322. It was mentioned above that TCP sequence numbers count bytes, and so this increment in sequence numbers (when sending SYN and FIN bits) is equivalent, in some way, to sending a byte. The designers of TCP realized that they wanted to be able to acknowledge SYN and FIN bits, and so assigned their own sequence number to each of these bits (each of which is sent exactly once per connection).

In other words, the first sequence number in a connection is reserved for the SYN bit, the last for the FIN bit, and every other sequence number in that connection corresponds to a byte of data.

In summary, if I am the client, I first send a SYN packet carrying the initial sender sequence number, or ISS. I then wait for a SYN+ACK packet carrying acknowledgement number ISS+1 and the initial receiver sequence number, IRS. I respond with an ACK packet containing sequence number ISS+1 and acknowledgement number IRS+1. This is known as the TCP three-way handshake, and is the most common way of establishing TCP connections.

The three-way handshake for connection establishment is shown on the left in the following picture.

The right side of the picture shows one of the possible scenarios for closing the connection, discussed in detail below.

TCP also supports other ways of establishing connections. Of purely academic interest is the TCP simultaneous connection, where both sides send SYN packets, and receive ACK packets in return. To capture all possible legal interactions, TCP defines a state machine showing what packets are sent and what state transitions occur as a result of receiving specific packets or of timers expiring. This state machine is defined in detail in RFC 793, with minor changes in RFC 1122, and the overview of the state machine from RFC 793 is reproduced here.

                              +---------+ ---------\      active OPEN  
                              |  CLOSED |            \    -----------  
                              +---------+<---------\   \   create TCB  
                                |     ^              \   \  snd SYN    
                   passive OPEN |     |   CLOSE        \   \           
                   ------------ |     | ----------       \   \         
                    create TCB  |     | delete TCB         \   \       
                                V     |                      \   \     
                              +---------+            CLOSE    |    \   
                              |  LISTEN |          ---------- |     |  
                              +---------+          delete TCB |     |  
                   rcv SYN      |     |     SEND              |     |  
                  -----------   |     |    -------            |     V  
 +---------+      snd SYN,ACK  /       \   snd SYN          +---------+
 |         |<-----------------           ------------------>|         |
 |   SYN   |                    rcv SYN                     |   SYN   |
 |   RCVD  |<-----------------------------------------------|   SENT  |
 |         |                    snd ACK                     |         |
 |         |------------------           -------------------|         |
 +---------+   rcv ACK of SYN  \       /  rcv SYN,ACK       +---------+
   |           --------------   |     |   -----------                  
   |                  x         |     |     snd ACK                    
   |                            V     V                                
   |  CLOSE                   +---------+                              
   | -------                  |  ESTAB  |                              
   | snd FIN                  +---------+                              
   |                   CLOSE    |     |    rcv FIN                     
   V                  -------   |     |    -------                     
 +---------+          snd FIN  /       \   snd ACK          +---------+
 |  FIN    |<-----------------           ------------------>|  CLOSE  |
 | WAIT-1  |------------------                              |   WAIT  |
 +---------+          rcv FIN  \                            +---------+
   | rcv ACK of FIN   -------   |                            CLOSE  |  
   | --------------   snd ACK   |                           ------- |  
   V        x                   V                           snd FIN V  
 +---------+                  +---------+                   +---------+
 |FINWAIT-2|                  | CLOSING |                   | LAST-ACK|
 +---------+                  +---------+                   +---------+
   |                rcv ACK of FIN |                 rcv ACK of FIN |  
   |  rcv FIN       -------------- |    Timeout=2MSL -------------- |  
   |  -------              x       V    ------------        x       V  
    \ snd ACK                 +---------+delete TCB         +---------+
     ------------------------>|TIME WAIT|------------------>| CLOSED  |
                              +---------+                   +---------+

In this state machine, a connection begins in state CLOSED and moves to the ESTABlished state via the three-way handshake on either the passive (server) or active (client) paths.

Once the connection is complete, the application (or the operating system) calls CLOSE, and the connection starts to move back to the CLOSED state, at which point the TCB can be deallocated. Closing requires each peer to send a packet containing the FIN bit. No new data can be sent after such a packet (although old data can be retransmitted as needed), so the FIN bit can be interpreted as a promise not to send any more data. After receiving and acknowledging a FIN packet, however, a system can still send new valid data -- in this case, the connection is considered half-open or half-closed, since data can still be transferred in one direction.

When closing, one of the two systems necessarily ends up in the TIME WAIT state, where the connection is kept alive for twice the maximum segment lifetime (MSL). The system that has to wait is the one that has sent the last acknowledgement. This is done to solve the problem of the last ack: how does the sender of the last ack know that the last ack has been received? Suppose the sender of the last ack did not wait, then if the ack were lost, the other system would be retransmitting the FIN packet, which we would be lacking the information to acknowledge properly, since the TCB would have been deallocated. We could change the design of TCP to acknowledge the final ACK, but then the last ack problem is simply shifted to the peer, who now has to keep enough state to properly acknowledge the ACK to the FIN. As an alternative, we might simply discard the TCB and send a nonsense response to a duplicate FIN, but that would fail to guarantee to the peer that all the data has been correctly received. The best solution is to keep the connection alive for a while, as TCP does. The maximum segment lifetime is usually taken to be 1-2 minutes, so the TIME WAIT state usually persists for 2-4 minutes.

Unfortunately, sometimes this means that a port is not available for a new socket for a while after the program has closed. If the client closes first, as is usually the case with HTTP, then the server does not need to go into the time-wait state. This is only relevant if the server binds to the port over and over again, which is unusual with most servers, but fairly common when you are testing code.

Finally, TCP provides another way of closing the connection. If I set the reset (RST) bit in any packet I send to my peer, I am communicating that some packet I have received is not compatible with my TCB or that I don't have an appropriate TCB for the packet. For example, a system may send a reset packet in response to a SYN that does not match any of the system's currently running servers. A connection can also be reset by a system that has rebooted, when receiving packets from old connections whose state has been lost.

Although resetting a connection is an effective way of closing the connection, closing confirms to both parties that all the data sent has been received, and is therefore preferable under most circumstances.

7. TCP, the Transmission Control Protocol

The major functions of TCP have been described above. These include

error detection via checksum of the pseudo-header, header, and payload
error and loss correction via retransmission and positive acknowledgements
flow control via the window to keep a fast sender from overwhelming a slow receiver which would have to discard received data
congestion control by using packet loss as an indication of congestion
connection management through signaling to establish or terminate connections.

Some perspective on the development of TCP is also given by this article: A brief history of the Internet, by Barry M. Leiner, Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard Kleinrock, Daniel C. Lynch, Jon Postel, Larry G. Roberts, and Stephen Wolff. This is recommended reading.

TCP is described by RFC 793. An update to all major Internet protocols was issued in 1989 as RFC 1122, which includes additions to TCP such as congestion control and the Nagle algorithm.

The only other major function of TCP that we have not yet mentioned is demultiplexing. The TCP header carries two 16-bit port numbers used to identify the socket that this packet belongs to. The first port is the source port identified with the sender of the packet, the second port is the destination port identified with the receiver of the packet. Programmers who are aware that a server must be bound to a port are sometimes unaware that both sockets must be bound to a port, though most systems will automatically select an available port if none is specified by the application. In common usage today, a well-known port is associated with a specific service: for example, port 80 is commonly bound to by HTTP servers, and port 22 by SSH servers. Most such servers can also be configured to run on different ports, which works fine as long as the corresponding client is configured to contact the servers on these different ports.

The TCP header, also from RFC 793, is shown below.

    0                   1                   2                   3   
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Source Port          |       Destination Port        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sequence Number                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Acknowledgment Number                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Data |       |C|E|U|A|P|R|S|F|                               |
   | Offset| Rsrvd |W|C|R|C|S|S|Y|I|            Window             |
   |       |       |R|E|G|K|H|T|N|N|                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Checksum            |         Urgent Pointer        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                             data                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Most of the header will be self-explanatory from the above discussion, including the two 16-bit port numbers, the 32-bit sequence and acknowledgement numbers, the 16-bit window and checksum, and the SYN, FIN, ACK, and RST bits. Two remaining bits, URG and PSH, and the data offset and urgent pointer, are discussed briefly in the remainder of this section. The ECE and CWR bits are used for congestion control and so were discussed in Section 5.

The data offset is analogous to the IP header length, being a 4-bit integer recording the number of 32-bit words in the TCP header. This allows for options to be carried in the TCP header. Because it counts whole 32-bit words, the TCP header size must be a multiple of 4 bytes, which means the options must also add up to a multiple of 4 bytes. Since some options may not be a multiple of 4 bytes, padding should be used when necessary to extend the header to maintain this requirement.

One popular TCP option is the Maximum Segment Size (MSS) option, in which the sender describes to the receiver the maximum segment size it is willing to receive, if it is different from the default of 536 bytes (536 bytes of data, plus the header). The MSS option, if sent, must be sent with the initial SYN packet in each direction. In theory, a system should only send segments larger than 536 bytes if it receives such an option. Common practice is to send the largest segment the MTU allows on local networks (i.e. where the network number is the same), and the 536 byte maximum to other networks. Other options include using the IP Path MTU discovery described by RFC 1981 for IPv6, or RFC 1191 for IPv4. Path MTU discovery will report on the maximum MTU supported by the routers in the path. While in theory a final receiver might be unwilling to handle an MTU-sized packet, that is not observed in practice, and so path MTU discovery is a reasonable way to determine MSS.

Another set of TCP header options allows the use of bigger windows and bigger sequence number spaces. As seen above, the TCP window is limited for two reasons: to avoid overflowing the receiver, and to avoid storing more data in the network than the network is willing to hold. Some networks and receivers work best, however, with really large windows. To consider this, think of the bandwidth of a network: this is the number of bits/second that the network can transmit. Now also consider the delay of a network: this is the time from the initial transmission of a unit of data to its successful receipt. If the bandwidth-delay product of a network is large, then when such a network is working correctly, it naturally stores large amounts of data between the sender and the receiver. One example of a network with large bandwidth-delay product is a satellite-based network, which might carry megabits/second and have a delay in the hundreds of milliseconds. The amount of data stored is exactly the bandwidth-delay product. The window is the maximum amount of unacknowledged data, that is, the maximum amount of data the sender may store in the network. Any window smaller than the bandwidth-delay product will therefore constrain the sender to send less than the network can transport, leading to inefficient use of the available channel. As a result, the authors of RFC 1106 have proposed a number of options to increase both the window size and the sequence number space.

The delay component of the bandwidth-delay product is the one-way delay if one is considering the amount of data actually stored by the transport network, but is the round-trip delay (RTT) if one is considering the window size, since the sender cannot start sending again until the ack has been returned to the sender.

The PSH bit is supposed to tell the receiver that a logical unit of data has been received, and that all accumulated data should now be "pushed" up to the application. The application is not normally aware that a push bit has been received, and also it is impossible to send multiple push bits with a single segment, so if data from several logical units is combined within a single segment, only a single push bit can be sent. These facts limit the usefulness of the push bit.

An urgent byte of data is a byte of data that is supposed to be of particular interest to the application, such that the application should be notified of it even if it has not yet consumed preceding bytes. This could be, for example, a Control-C or other interrupt character in an interactive terminal session. If such an urgent byte is sent, the urgent pointer is set to the position of this byte within the segment (0 if the urgent byte is the first byte in the segment, 1 if the second byte is urgent, and so on), and the URG bit is set. This mechanism suffers from the inability to have multiple urgent bytes per segment, and, apart from marking interrupt characters in terminal sessions, is not widely used. URG is similar to PSH in that, if two urgent bytes are sent as part of the same segment, only one of them can be marked as urgent.

8. Other Transport Layer Protocols: UDP, Real-Time

8.1 UDP

After studying TCP, UDP, the User Datagram Protocol, described in RFC 768, seems very simple. The 8-byte header from RFC 768 follows.

 0      7 8     15 16    23 24    31  
+--------+--------+--------+--------+ 
|     Source      |   Destination   | 
|      Port       |      Port       | 
+--------+--------+--------+--------+ 
|                 |                 | 
|     Length      |    Checksum     | 
+--------+--------+--------+--------+ 
|                                     
|          data octets ...            
+---------------- ...

Most of the fields in this header have the same function as the analogous headers in TCP. The checksum field is set to zero to record that no checksum was computed (the checksum is mandatory for TCP, but not for UDP over IPv4). The length field is the total length of the UDP header and payload, and was probably included to make the header size a multiple of 4 bytes, since it is redundant given the total_length field in IP.

Unlike TCP, UDP is connectionless. UDP can be seen as providing an IP-like service to applications by supplementing the IP functions with demultiplexing (the port numbers) and the optional checksum.

8.2 Real-Time Protocols: RTP, RTCP, and RTSP

In all of the above protocols, correctness of the data is paramount. In TCP, it is essential that all the data sent be delivered, or if that is not possible, that the sender be informed that the confirmation is missing.

In a real-time protocol, correctness of the data may not be essential. Consider for example a video stream. If some of the pixels in a frame are corrupted, they will show up as random noise in the image. However, a given image is only displayed for about 1/60th of a second, so lack of correctness simply produces a loss in quality. The same is true for loss of an entire frame -- if there are significant losses, the images do not flow smoothly from one to the next, yet the video stream may still be useful, intelligible, and interesting. Guaranteeing correctness, by retransmitting and thus delaying the entire stream, may be less acceptable than the occasional data losses we have considered. This makes video streaming a real-time application, in which late delivery is usually worse than timely delivery with occasional losses or mistakes. There are many real-time applications, including audio delivery and Voice over IP (VoIP), as well as remote control of physical systems.

Many people implementing real-time applications have used UDP, since UDP packets are never retransmitted and do not suffer from slowdowns caused by flow control or congestion control (note that flow control and congestion control should still happen, and the design of a congestion control mechanism for real-time flows equivalent to the TCP congestion control is the subject of active research).

Applications do not have to use plain UDP, as there are a number of real-time transport protocol that have been designed for the Internet. These will only be considered briefly here, with the interested reader encouraged to consult the relevant RFCs. These protocols are generally layered over UDP, using well-known ports to determine which transport protocol is being used. Real-time transport is generally unidirectional, though if two-way transport is needed (as in VoIP), it can be provided by setting up two channels in opposite directions.

The Real-Time Transport Protocol, RTP, is described in RFC 1889 and RFC 1890. RTP is designed as a foundation for a class of real-time transport protocols, providing at a minimum a source identifier, timestamp, and sequence number for each packet. The meaning of the timestamp and of the data following the minimum header depends on a further specification, called a profile, which varies from application to application -- RFC 1890 provides such a profile for audio and video profiling.

RTP also provides a control protocol, RTCP, which can be used to monitor the quality of service delivered to the recipients. RTCP was carefully designed to work well over multicast, where the quality experienced by some receivers may differ substantially from the quality experienced by other receivers, and where simply sending all the feedback data back to the source could easily overwhelm the source.

The Real Time Streaming Protocol, RTSP, ( RFC 2326), can be used to synchronize a number of independent real-time data streams, for example, audio and video.

9. Summary: What are the principles?

It may be better to have the sender do all or most of the work of flow control, congestion control, silly window avoidance, combining segments, and the like. The sender knows how much data it has to send, and has more information about round-trip time and packet drops. Also, a fast sender could overwhelm a slow receiver, but the opposite is not true.
When designing for reliability, it is important to be aware of what can go wrong. An example of this is the inclusion of the pseudo-header in the checksum computation, accounting for inadvertent misdelivery.
TCP provides end-to-end services such as congestion control and error recovery. These services can also be provided on a hop-by-hop basis. The end-to-end argument states that such services MUST be provided on an end-to-end basis in order for the network to work well, but providing them on a hop-by-hop basis might help the overall performance, for example by avoiding the relatively slow TCP retransmission whenever possible. One reason end-to-end is mandatory is that the network is complex and may fail unpredictably and without notice, whereas the users at the endpoints are in a good position to observe failures of the endpoints themselves.
The original TCP had no congestion control and no adaptive timers. The Internet, although designed and built, is not completely predictable, and interesting and unplanned-for phenomena such as congestion collapse can sometimes be observed.
The designers of TCP made little effort to preserve segment boundaries, but many programmers, especially beginning programmers, fail to check for additional data. This is a particularly glaring mismatch between specification and expectations, something that is very hard to avoid in protocol design, although such mismatch should be avoided whenever possible.

10. Peer-to-Peer (P2P) networks

If see this note, please remind the instructor that this section is incomplete.