TCP
- TCP streams and push
- TCP header
- tcpdump and wireshark
partial reminder: TCP window scaling
- the window field in the TCP header is 16 bits, so the largest
window is 65,535 bytes
- this is not enough for full bandwidth on a 100ms (RTT) gigabit
ethernet, with a bandwidth-delay product of 100Mb = 12.5MB
- so TCP provides a window scaling option, sent with the
SYN packet
- the option only takes effect if both sides send a window scaling
option with their SYN packet
- window scaling is defined in
RFC 1323,
"TCP Extensions for High Performance", which also provides protection
against wrapped sequence numbers.
TCP options
- TCP options may follow the basic header
- most TCP options are only sent with the SYN and SYN+ACK packet
- this picture shows options sent with a SYN
- and with the corresponding SYN-ACK
- can you guess the standard format for options?
TCP Streams and push
- TCP actually has a segmentation bit: PSH, or push
- when the application "pushes" the data, that information could be
conveyed all the way to the application at the other end
- if TCP can coalesce several user segments (each with PSH) into one
TCP segment, that TCP segment can only carry one PSH bit
- so passing PSH to the application is optional, and TCP has
no record boundaries
- so push is an advisory bit only: it encourages TCP (and the
application) to assume that the data must be sent (and presumably
replied to) before more data will be sent
TCP header
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |C|E|U|A|P|R|S|F| |
| Offset|Reservd|W|C|R|C|S|S|Y|I| Window |
| | |R|E|G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options | Padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
TCP Header Format
TCP Header fields
- Source and Destination port: demultiplexing
- Sequence and acknowledgement: reliable delivery
- Data Offset: header size, options
- Window: flow control
- Checksum: correctness
- Urgent Pointer: "special place" in the data stream
TCP Header bits
- SYN: I want to establish a connection
- FIN: I will never again send data on this connection
- RST: kill this connection
- PSH: immediate delivery of this data is probably a good idea
- URG: the urgent pointer is valid
- ACK: the acknowledgement field is valid (set in all but the
first SYN packet)
- ECE: this packet acknowledges a packet received with the IP
"congestion experienced" bit set
- CWR: the sender of this packet has reduced its congestion window
- the last two bits will be discussed in the context of congestion
control
tcpdump and wireshark
- tcpdump is a utility to look at all the packets on the network
and print out the headers
- usually run as root
- Wireshark is similar but (a) window-based, (b) newer
- Wireshark was forked from a project called Ethereal
- The pictures above, of the TCP options, are from wireshark
16:41:58.905998 maru.ics.hawaii.edu.14407 >
volcano.telnet: S 2671654129:2671654129(0)
win 512 [tos 0x10]
16:41:59.115893 volcano.telnet >
maru.ics.hawaii.edu.14407: R 0:0(0)
ack 2671654130 win 0
tcpdump example
16:47:02.285753 maru.1022 > volcano.ssh:
S 185741093:185741093(0) win 512
16:47:02.495648 volcano.ssh > maru.1022:
S 3829593384:3829593384(0)
ack 185741094 win 16352
16:47:02.495648 maru.1022 > volcano.ssh:
. ack 1 win 32120 (DF)
16:47:07.183328 volcano.ssh > maru.1022:
P 1:16(15) ack 1 win 16352 (DF)
16:47:07.183328 maru.1022 > volcano.ssh:
P 1:16(15) ack 16 win 32120 (DF) [tos 0x10]
16:47:07.433203 volcano.ssh > maru.1022:
P 16:292(276) ack 16 win 16352 (DF)
16:47:08.502673 volcano.ssh > maru.1022:
P 16:292(276) ack 16 win 16352 (DF)
16:47:08.522663 maru.1022 > volcano.ssh:
. ack 292 win 32120 (DF) [tos 0x10]
transport layer
- demultiplexing
- microprotocol implementation
- congestion collapse
- congestion control: TCP Reno
Implementation and Demultiplexing
- a monolithic implementation of networking protocol stack would
look at IP source and destination, protocol number, and TCP/UDP ports to
select a socket and TCB corresponding to the packet
- in a layered implementation:
- the IP layer uses source, destination, and protocol
to identify "connection"
- TCP/UDP (transport) layer uses source and destination port
to identify TCB/socket
- Microprotocol implementation:
- demux layer looks at protocol to choose TCP or UDP upper layer
- demux layer uses source IP to determine "connection" (corresponding
to all TCP or UDP packets from that source IP)
- check layer checks destination
- demux layer uses source port to determine "connection" (which includes
all packets from the given source IP and the given source port)
- demux layer uses destination port to finally determine socket
Router Congestion
- assume a fast router
- two ethernet links receiving lots of outgoing data
- one (relatively) slow T-1 link (1.5 Mb/s) sending the outgoing data
- if the two links send more than 1.5 Mb/s over an extended period,
the router buffers begin to fill up
- eventually the router will have to discard data due to congestion:
more data being sent than the line can carry
Congestion Collapse
- assume a fixed timeout
- if I have n bytes/second to send, I send them
- if they get dropped, I retransmit them (total 2n bytes/second,
3n bytes/second, ...)
- when there is congestion, packets get dropped
- if everybody retransmits with fixed timeout, the amount of data
sent grows, increasing congestion
- eventually, very little data gets through, most is discarded
- the network is (nearly) down
TCP Reno
- exponential backoff: if retransmit timer was t before
retransmission, it becomes 2t after retransmission
- careful RTT measurements give retransmission as soon as possible,
but no sooner
- keep a congestion window:
- effective window is lesser of: (1) flow control window, and
(2) congestion window
- congestion window is kept only on the sender, and never communicated
between the peers
- congestion window (cwin) starts at 1 MSS, grows by 1 MSS for every MSS
acked: this is the exponential growth phase of the congestion window, called
slow start
- on a retransmission, thresh = cwin / 2, and cwin = 1
- then, use slow start while cwin < thresh
- then (after cwin >= thresh) for each ack, add to the
window the value MSS * newly-acked/window: this adds one MSS
to the window for each whole window that is acked (typically, once
every RTT) resulting in linear growth
- fast retransmit is similar -- interesting details at
RFC 2001.
RTT estimate
- RFC 1122, section 4.2.3.1
- RTT estimate must be accurate, or TCP will incorrectly assume
that the network is congested
- Karn/Partridge algorithm: don't use retransmitted segments
for RTT estimation.
- for accurate RTT estimate, keep a running average of RTTs:
RTTaveragex = (1 - alpha) RTTaveragex-1 *
alpha RTTx
- For example, alpha = 0.125 (1/8).
- Could set the timeout to RTTx * 2.
- but also keep track of the variance in RTT.
Jakobson/Karels algorithm
- receive an ack with round-trip-time RTTx
- New estimate: RTTaveragex = (1 - alpha) RTTaveragex-1 + alpha RTTx
- Deviation average:
DevAvex = (1 - beta) DevAvex - 1 + beta |RTTx - RTTAveragex - 1|
- Timeout: Timeoutx = u RTTaveragex + phi DevAvex
0 < delta < 1 | (typically, delta = 1/8 for RTTaverage,
and 1/4 for Dev) |
u = 1 | |
phi = 4 | |
|
Congestion Control
- TCP Vegas
- other ways of detecting congestion
- addressing congestion
- router intervention
- Internet Explicit Congestion Notification
TCP Vegas
- Reno detects congestion after it happens
- Reno also causes congestion by increasing the window until
congestion occurs
- early congestion detection:
as queues get filled in the router, packets take longer, so the RTT increases
- when RTT gets bigger, we can slow our sending
- when RTT gets back to minimum, we can increase our sending
- not standard, but tested to work well
Detecting and Addressing Congestion
- detecting congestion:
- queues get longer
- RTT gets bigger
- data / RTT ( power) starts to drop as you try to send more
- addressing congestion:
- additive increase/multiplicative decrease (needed for stability if
congestion is occurring)
- additive increase/additive decrease (TCP Vegas) -- works as long
as congestion can be avoided
- setting flow rate
- bandwidth reservation
Router Intervention to Avoid or React to Congestion
- Random Early Discard -- causes TCP to back off
- information feed-forward -- the receiver must then return
congestion information to the sender (see Internet ECN, below)
- information feedback -- requires route back to sender, does not
work in Internet (except source quench ICMP, which is deprecated)
- communication time from router to sender may be insufficient if
sender is sending lots of stuff. Also, stability issues -- all senders
could increase their sending rate at the same time
- credits: can only send as much as we have in the "bank", automatically
(but not immediately) replenished
Internet Explicit Congestion Notification
- ECN, explicit congestion notification,
RFC 3168.
- in ECN, two of the bits of the Type of Service (ToS) field are
used to indicate (a) whether congestion notification is requested (ECT),
and (b) whether the packet experienced congestion (CE).
- TCP uses two new bits: ECE (ECN-Echo, to report that a packet
was received with the CE bit set -- bit before URG),
and CWR (Congestion Window Reduced, bit before ECE),
to indicate that the ECE bit was received.
- compatible with hosts and routers that don't do ECN
- typical usage of ECN:
- senders can set ECT
- routers can change ECT to CE to record that congestion
was experienced, perhaps instead of dropping a packet
- transport layer is informed of CE, sends an ECE
- receiver of ECE reduces congestion window, sends CWR