ICS 699 Project Proposal

ICS 699 Project Report By Xiaochun Hu, Zhifeng Jia Instructor: Edo S. Biagioni

A Reliable Transport Protocol over ATM

Overview

The purpose of this project is to design and develop an ATM Transport Protocol ( ATP ) to carry data traffic reliably and efficiently. We plan to use the ATM API provided by Fore System, and implement a transport protocol layer above ATM AAL5.

The specific design of ATP will include a fragmentation and reassembly mechanism, a retransmission method and a source based congestion avoidance and control algorithm.

Since ATM is a connection oriented, best effort network protocol, it only provides unreliable data transmission. Fore System ATM API provides ATM Adaptation Layer that would carry data segments of underlying network MTU size. It also provides Quality of Service in the connection establishment stage, such as guaranteed minimal bandwidth and possible peak bandwidth. But the data transmission is still best effort. Our ATP design will provide reliable features on top of ATM AAL5. By introducing the packet fragmentation and reassemble function, ATP will carry data packets of arbitrary length, eliminating the limits of the MTU size. By retransmission of the dropped data packets, ATP will achieve reliable data traffic. Finally, by a congestion avoidance and control mechanism, ATP will enhance the performance and contribute to better network throughput.

In general, ATP will be designed to utilize and accommodate properties of ATM such as connections, high-reliability, order preservation guarantees and available bandwidth reservation. Many of these properties are similar to corresponding features of TCP, so it would be useful to demonstrate that a specialized protocol can substantially outperform the current practice of carrying TCP over the connectionless, unreliable and non-sequence-processing IP.

Significance

Asynchronous Transfer Model is a network technology that provides features substantially different from those of other common, packet-based network technologies. There are many significant features of ATM networks. ATM is connection-oriented, highly reliable in preventing bit errors and out of order delivery, and able to provide quality of service. All of these features make it possible to design and develop a new protocol simpler than the TCP/IP stack to fulfill the requirements of TCP/IP.

The proposed ATP design and development will advance the state of art in the field of networking by replacing a popular but complex protocol with a simpler specialized protocol. In addition, this project will evaluate on a cell-based network the performance of flow-control algorithms developed for TCP, as well as make use of the resource reservation mechanism provided by ATM.

More generally, the successful development of ATP will demonstrate the benefits of developing specialized protocols to implement more general protocols for specific types of networks, and perhaps motivate and encourage the development of other such protocols.

Problem Statement

Many distinguished scholars have carefully studied the cost of executing TCP/IP. Their conclusion is that the major costs of running a conventional TCP/IP implementation are data copy, operating system overhead, and checksum computations. They also argue that most of these costs are inherent in any transport protocol with the same functionality. Such costs affect end-users and applications by limiting the achievable throughput and also by consuming CPU cycles that could otherwise be used by user programs.

The purpose of ATP is to reduce or eliminate as many as possible of these costs:

The cost of checksum computations can be eliminated entirely by relying on the low error rate of ATM and on the CRC provided by the ATM Adaptation Layer.
The cost of protocol processing can be substantially reduced by implementing a simpler protocol designed to run on top of ATM. For example, ATP requires smaller packet headers than TCP/IP and has fewer operations to perform on each packet.
The cost of lost packet recovery and retransmission can be substantially reduced by making it less likely that a packet will be lost due to congestion
The costs of searching for the right sending rate can be reduced by a quick estimate of available bandwidth by using the ATM resource reservation.

Objectives

The objectives of this project are to design and implement a specialized protocol, ATP, to carry TCP traffic efficiently over ATM. A further objective is to evaluate the effectiveness of the algorithm chosen in the project.

Protocol Design

Completion of the protocol design requires selecting, adapting, or designing several algorithms, especially a retransmission algorithm, a start up algorithm, a congestion avoidance and control algorithm, and an encoding for information that must be carried to implement those algorithms and to conduct fragmentation and reassembly.

Retransmission

The retransmission algorithm is very simple. Since the underlying ATM network will either deliver the correct data segment or drop the segment and there is no bit corruption, duplication or reordering of the segments, an acknowledgement based scheme seems best. If the expected segment is received at the receiver side, an ACK will be sent back to the sender. Or if the receiver observe the gap of sequence number from the adjunct segments, an NAK of the data fragments with the smallest sequence number that haven’t arrived yet will be sent back to indicate the lost of segment. ACK is accumulative and NAK is not. Acknowledgement packets are all 4-bytes long. The first 30 bits stand for the corresponding sequence number that this acknowledgement packet is for, and the 31^st bit is the sign of ACK or NAK. If ACK, the 31^st bit is cleared (0), otherwise, the 31^st bit is set indicating it is a NAK. The 32^nd bit(LSB) is always set for acknowledgement packets. The following graph shows the ACK/NAK packets format. (big-endian)

On the sender side, if an ACK is received, the sender just processes the normal packet handling and congestion control. If a NAK is received, the sender will start retransmission of the data fragments with sequence number carried by NAK right away. So NAK is a trigger for fast retransmission.

Fragmentation and Reassembly

First of all, the fragmentation and reassembly module will fragment the data packets by the underlying MTU size of the network. Also each data segment requires a special header to carry information such as the sequence number and the last-segment identification bit. We have decided to use a 4-byte header for each data fragment, so the actual payload of the data fragments is MTU - HEADER_SIZE. In this 32-bit header, the first 30 bits stand for the sequence number of the data fragment, the 31^st bit indicates whether this fragment is the last fragment for the complete message, and the 32^nd bit (LSB) for data fragments is always cleared to 0. The following graph shows the header format.

The header of data fragments is similar to acknowledgement packets. Both are 32 bits long and have same meaning for the first 30 bits. The advantage of this design is that when the receiving thread receives a packet from the network, by reading the first 4 bytes, it could tell if it receives a data packet or an acknowledgement packet. Data fragments have the 32^nd bit cleared, and acknowledgement packets have the 32^nd bit set. In summary, the receiving thread could distinguish these 4 kinds of packets by the following table.

Packet	31^st bit	32^nd bit	Low order 2 bits
Normal data packet	0	0	0
Last data packet	1	0	2
ACK	0	1	1
NAK	1	1	3

The sequence number of the data fragments is part of the state kept for each connection. The connection will also keep track of the next sequence number that is available for the data fragments of the next message, so messages will not be affected by the delayed ACKs or data fragments from the last Sending or Receiving call in the same connection. And as there are 30 bits used for sequence numbers, if the next available sequence number is larger than 2 ^ 29, the global sequence number will be reset to 0. So there is an assumption in the protocol that, no messages from the application would have more than 2 ^ 29 data fragments of MTU – HEADER_SIZE size.

Congestion Control

The congestion avoidance and control algorithm will be along the lines of Jain's CARD (Congestion Avoidance using Round-trip Delay) approach. It is a window based, source based, implicit information feedback approach. Each time a data segment is successfully delivered, the congestion window size will be adjusted according to both the changes in the RTT and previous window size. In other words, every time an ACK is received, the congestion control algorithm will be triggered to adjust the congestion window size. The following pseudo code will give a clear description of this algorithm.

Given round-trip delays D and Dold at windows W and Wold respectively:

NDG = ( D - Dold ) / ( D + Dold ) * ( W + Wold ) / ( W - Wold );

IF ( NDG > 0 or W = Max WindowSize )

THEN Decrease ( W );

ELSE IF ( NDG <= 0 or W = Min WindowSize )

THEN Increase ( W );

Min WindowSize and Max WindowSize are lower and upper bound on the window. The upper bound is set equal to the flow control window permitted by the receiving node based on its local buffer availability considerations. In our project, we assume that the receiver has unlimited buffer space to hold the incoming data fragments, so there is no upper bound for the window size. And because ATM has guaranteed minimal bandwidth reservation, the lower bound is calculated by the following formula.

Min WindowSize = ( Minimal Bandwidth guarantee) * (Initial estimated RTT) / (MTU)

By setting the upper and lower bounds both equal to Max WindowSize ( in our case, which is unlimited ), we can disable the window adjustment and congestion control algorithm.

Previous research results prove that CARD approach converges to optimal level independently of the window values used at connection initialization.

Start Up Algorithm

The start up algorithm is used on a connection to determine the initial window size. The well-designed algorithm should be able to choose the window size that would both lead to best utilization of the bandwidth and avoid congestion and segments loss. Because ATM has the ability to provide Quality of Service information at the connection establishment stage, we could make use of that information to jump to the best assuming window size.

On the active side of connection establishment, the Initial estimated RTT is obtained by calculating the time required to the connection establishment. On the passive side of a connection establishement, the Initial estimated RTT is set to a constant. In our project, it is set to 1 second. The sender can initially set the WindowSize = c, while c is a constant, and start to send out the normal data segments right after the connection establishment. When the ACK for the first data fragment comes back, the sender can reset the WindowSize by using the above formula. We will first set constant c equal to 1. The selection of a reasonable value for c is left to a later phase of this project. This start up algorithm will let the data traffic jump to the optimal stage very quickly. Since the sender could start sending data segments right after the connection establishment, this algorithm will not introduce any delay compared with slow-start algorithm.

Quality of Service

The ATP protocol will guarantee some Quality of Service. During the connection, the actual sending rate will be calculated each time when an ACK is received.

Actual Sending Rate = ( Current WindowSize ) * ( MTU ) / ( Current RTT )

If the Actual Sending Rate is lower than the Minimal Bandwidth for the connection, the ATP protocol will reset the window size to Min WindowSize. If the Actual Sending Rate is higher than the Peak Bandwidth for the connection, ATP will continue using CARD algorithm to exploit the extra bandwidth available along the connection.

Implementation

ATP is implemented and realized on Windows NT system, by using the Winsock API and Fore System ATM API. The full implementation will result in designing several library functions above the Winsock API and Fore System ATM API libraries. These functions will provide reliable and efficient data transmission services with congestion avoidance considerations.

Currently, the ATP protocol is implemented with the base classes from Microsoft Foundation Classes, and it is running above TCP.

The main implementation involves a ATPSocket class, including many member variables that support the protocol design. Also the ATPSocket class provides a set of methods for the applications to reliably establish the network connection and deliver messages of arbitrary length. The methods provided are:

ATPSocket(), constructor of the ATPSocket object, will return an ATPSocket object handle for the application to use the methods provided by this class to deliver messages. The constructor also gives initial values to all the class variables, the ASTSendThread and ARTRecvThread are set to NULL here.
Bind(), for passive connections, to bind an ATPSocket to a Service Access Point, basically the same as the Unix socket call.
Listen(), for passive connections, detects the ATPSocket listen to the Service Access Point that might have incoming connections from the other end.
Accept(), for passive connections, detects an incoming connection, then creates a new ATPSocket to receive messages or sends messages to the other end. This call also creates sending and receiving threads for the new ATPSocket that is ready to handle the message deliveries, and activates both of them. If nothing on sendList, sending thread blocks, and the receiving thread is always running and waiting for something to arrive. The sending and receiving threads would both suspended by the ATPSocket Close() call, and deleted when ATPSocket is deleted.

Also at socket creation we get the MTU size of underlying network and set the member variable of new ATPSocket to this value. From the incoming connection, we obtain the QoS for the data traffic from this point to the other end, ( this is allocated by active connection partner ), and give this value to Minimal Bandwidth guaranteed. With the constant of Initial estimated RTT , we calculate the Min WindowSize for the congestion control mechanism.*

Connect(), in active connections, connects to the peer. And creates sending and receiving threads and activates both of them. Sending thread is waiting for the Send() call to put the sendItem object on sendList, and the receiving thread is waiting for anything from the other end.

Also we get the MTU size of the underlying network and set the member variable of the ATPSocket to this value. We let the application define the QoS for data traffic of both directions, and we allocate the minimal bandwidth for both directions. If the underlying network could not allocate what the application wants, we return SOCKET_ERROR. If we succeed in allocating resources, set the guaranteed QoS to Minimal Bandwidth guaranteed. And set the Initial estimated RTT to the time used for establishing the connection, this allows us to calculate the Min WindowSize for the congestion control mechanism.*

Send(), packs the application message into the sendItem object, fragmented and ready to be sent, then put this object on sendList and waits for the sending thread to finish sending it and remove it from the list.
Recv(), creates a recvItem object to wait for the incoming data fragments. After detecting the last fragments, reassembles fragments and puts the object on recvList. The Recv() call then removes it and delivers the whole message to application.
Close(), if the sending and receiving threads are created and activated, suspends both of them, and closes the ATPSocket.

All the methods are analogous to the Unix socket specifications, that is to say, they have the same meaning as the corresponding Unix socket calls and take basically the same set of parameters and return types. (* not implemented )

Each ATPSocket object has two threads that are very important for the implementation.

ASTSendThread, This sending thread is created and activated either by Connect() (active connections) or by Accept() (passive connections). It will continuously send out the data fragments of the sendItem objects on the sendList, suspend if nothing on sendList. And if the valid window is closed, suspend to wait it to be reopened.
ARTRecvThread, This receiving thread is created and activated either by Connect() (active connections) or by Accept() (passive connections). It will wait for both acknowledgement packets and data packets, and dispatch data packets and acknowledgement packets to the different Send() and Recv() calls.

For acknowledgement packets, if ACK is received, the congestion control method will be triggered to update the congestion window size, and the valid window for the current sendItem object might change either because of congestion window change or because of a window moving forward. And if NAK is received, the retransmission of the corresponding data fragments is triggered.

For data packets, an ACK or NAK will be sent back to indicate receipt of fragments, or expectation of the previous fragments. And if one message is complete, put it as a recvItem object on the recvList, and let the Recv() call remove it from recvList and deliver it to application.

In short, the ATP protocol has two parts, the upper part consists of all the ATPSocket calls, including connection establishment, tear down, blocking Send and Recv calls; and the lower part with sending and receiving threads, which continuously handle the data/acknowledgement exchange. The upper part and the lower part cooperate in implementing a reliable transmission of arbitrary length message with congestion control.

Assumptions and Specifications of the Protocol

The following assumptions and specifications are made on this ATP protocol:

The header of the data fragments are 4bytes long: the first 30 bits are for sequence number, the 31^st bit is set for "last fragment" indication, and the 32^nd bit is always cleared.
The acknowledgement packets are 4 bytes long: the first 30 bits are for sequence number, the 31^st bit set for NAK and clear for ACK, and the 32^nd bit is always set.
The Global sequence number ranges from 0 ~ 2 ^ 30. When the next global sequence number is greater than 2 ^ 29, it is reset to 0 on the next message boundary.
The network connection is bi-directional
Multiple ATP Send calls would be handled sequentially, with no interleaving of data fragments from different messages
Congestion control parameters are kept and updated during the process of connection, and will serve for multiple ATP Send calls.
Cumulative ACK will trigger the congestion control mechanism, and non-cumulative NAK will trigger fast retransmission of the NAKed data fragment
Blocking Send and Recv call. It is the application’s responsibility to make time out if necessary.
No message from the application would exceed the length limitation of 2 ^ 29 * ( MTU – HEADER_SIZE )
The receiver has unlimited buffer space, so there is no upper bound for Max WindowSize

Project Status

Because we encountered unforeseen difficulties with the FORE ATM adapter card, the ATP protocol now is implemented above TCP connection. So we did not implement the Quality of Service part of the protocol, which is only belong to ATM connection.

Currently, the ATP protocol only supports blocking Send and Recv calls, so Send and Recv will only return if a complete and correct message is successfully delivered by the underlying network. Further work would involve implementation of timeouts and notification of the caller about the status of the message delivery, which would avoid having to program them on the application level.

Also protocol objects are implemented above Microsoft Foundation Classes. For instance, sending and receiving threads are derived from CWinThread, and sendFragments and receiveFragments object classes using many parameters of CString objects. So the efficiency issue is rarely addressed in the protocol. If there would be further testing and evaluation conducted on this ATP protocol, source code need to be optimized.

References

Raj Jain; Digital Equipment Corporation; "A Delay-Based Approach for Congestion Avoidance in Interconnected Heterogeneous Computer Networks"; Littleton, MA

Zheng Wang, Jon Crowcroft; Department of Computer Science, University College London; "Eliminating Periodic Packet Losses in the 4.3-Tahoe BSD TCP Congestion Control Algorithm"; London, UK

Lawrence S. Brakmo, Sean W. O'Malley, Larry L. Peterson; Department of Computer Science, University of Arizona; "TCP Vegas: New Techniques for Congestion Detection and Avoidance"; Tucson, AZ

Zheng Wang, Jon Crowcroft; Department of Computer Science, University College London; "A New Congestion Control Scheme: Slow Start and Search ( Tri-S )"; London, UK

Lawrence G. Roberts; ATM Systems Division, Connectware Inc.; "Second Generation Flow Control - Explicit Rate Traffic Management for ATM";

Nasir Ghani, Jon W. Mark; Department of Electrical and Computer Engineering, University of Waterloo; "Rate-Based Control Algorithm for Available Bit Rate ( ABR ) Service in ATM networks"; Waterloo, Ontario

"Performance Evaluation of Congestion Control Mechanisms in ATM Networks"