Computer Networks Notes 1: Overall Networking Concepts and Programming


1. Goals of Computer Networking

The main purpose of computer networking is to allow distributed programs to communicate.

This is a sweeping definition, covering local- and wide-area networks (LANs and WANs) and modems, the world-wide web and Internet telephony, wireless and fiber optics. It excludes older communications technologies such as traditional telephony and the broadcast media, which were not originally designed for computer-to-computer communication.

What separates computer communication from the other technologies is the need to exchange bytes and collections of bytes among programs. This imposes a number of requirements:

  1. It must be possible for a program to indicate where the bytes must go, by specifying a destination address, which uniquely identifies the receiver.
  2. It must be possible for a computer to send the bytes to at least one other computer, which must be "closer" (by some definition of "closer") to the destination. Communicating with a directly connected computer is done at the data link layer. Forwarding the packet to the final destination is the job of the network layer.
  3. In some applications, if bytes are not correctly received at the destination, the sender should send them again and again, until the destination acknowledges their receipt. This is a job for the transport layer, which can also slow down the sender if the network or the receiver is overwhelmed. The transport layer may also decide which of many programs on a given destination machine needs to receive the bytes.
Some examples of what can happen when these requirements are not met:

Organization of these Notes

2. Applications

A network, even one that successfully supports the exchange of data and therefore satisfies the above goals, is only useful if it supports actual applications.

The applications themselves may use different protocols. These application-specific protocols belong in the application layer. Application layer protocols depend on the lower-layer protocol to transfer the data, and add services such as security, file sharing, directory maintenance, email, web access, network debugging, and much more. These protocols use the transport, network, and data link layer protocols to actually transfer the data.

Two or more programs that are communicating are called peers. The word peer implies equality, and in most cases communication between peers is bidirectional and in some sense equivalent.

One common model of computer communication is highly asymmetric: one peer, the client, sends requests (encoded as bytes) to another peer, the server, which sends replies. In the client-server model, requests and replies are matched, with one reply for every request. The client chooses the server, sends the request, and waits for the reply. The server has an infinite loop which accepts requests, does whatever work is necessary, then sends the appropriate reply back to the client. Each server can serve many clients, each client can access many server.

One example of the client-server model is the world-wide web. The client, usually a web browser or a web robot, sends a request to a web server using a protocol called HTTP. This protocol allows the client to specify which file from the remote system is desired. The server receives the request, checks that the desired file is available, and if so sends back the contents, again using HTTP. If the file is not available, HTTP provides mechanisms for the server to communicate this to the client.

Other client-server applications include:

All of these applications include servers that provide services (files, processing, storage of conversation) and clients that request services. The details of each interaction are very different -- some may not even match the client-serve model except to the extent of calling one program the server and one program the client. For these reasons, standard application programming interfaces (APIs) do not directly support the client-server model. For the same reason, we do not spend much time studying specific application level protocols, though each one can be very important to a large community of users. HTTP is the only application-layer protocol we will study in this course.

The APIs do support peer-to-peer transfers, letting the programmer implement any necessary client or server interactions. Although these APIs are available for a number of languages, what follows describes the APIs for the C language. Due to its efficiency and flexibility, and due to the large volume of programs already written in C, C is still the most commonly used language for implementing network applications, although Java, C++, and other languages keep gaining in popularity.

3. Network Programming

The API used for communication over the Internet is called the sockets API. There is a version originally developed at Berkeley University for BSD Unix, which is now supported by Linux and other Unix-like systems. There is also a variant that works under Windows. The Unix sockets API includes the following functions:
  1. functions to set up and manage connections:
    int socket(int domain, int type, int protocol);
    int bind(int s, struct sockaddr *my_addr, socklen_t addrlen);
    int listen(int s, int backlog);
    int accept(int s, struct sockaddr *addr, socklen_t *addrlen);
    int connect(int s, struct sockaddr *server_address, socklen_t addrlen);
    int shutdown(int s, int how);
    int close(int s);
    
  2. functions to send data
    int send(int s, const void *msg, int len, int flags);
    int sendto(int s, const void *msg, int len, unsigned int flags,
               const struct sockaddr *to, socklen_t tolen);
    int write(int s, const void *buf, int count);
    
    On windows, write cannot be used to send data on sockets.
  3. functions to receive data
    int recv(int s, void *buf, int len, int flags);
    int recvfrom(int s, void *buf, int len, unsigned int flags,
                 struct sockaddr *from, socklen_t *fromlen);
    int read(int s, void *buf, int count);
    
    On windows, read cannot be used to send data on sockets.
  4. additional functions:
    int gethostname(char *name, int len);
    struct hostent *gethostbyname(const char *name);
    struct protoent *getprotobyname(const char *name);
    
  5. functions required in the Windows Socket API (WSA), and not present in the Unix Socket API:
    int WSAStartup(int version, WSADATA *implementation);
    int WSACleanup();
    

Where to find information

Much more information is available from the manual pages for each function. These pages are usually available on the system, by typing man subject, and also at many places on the web. For Windows help, refer to the Windows documentation.

Another reference is Comer and Stevens' "Internetworking with TCP/IP -- Volume III -- Client-Server Programming and Applications". The copy I have seen was published in 1997 by Prentice Hall, ISBN 0-13-848714-6, and proclaims on the cover "Winsock Version of Client-Server Programming". However, at least one student in this course has told me this book covers the material but not in sufficient detail -- he had the Linux/Posix edition from 2000.

The following is a list of books that some of my ICS 451 students suggested, in no particular order:

I suggest you check a book carefully for usefulness at your level of expertise before making any financial committment.

If you look at the functions bind, listen, accept, connect, shutdown, close, send, sendto, write, recv, recvfrom, and read you will see that all take as first argument an integer file descriptor such as returned by the call to socket. If C were object oriented, it is possible that this design would have been somewhat different, with an object of class socket having methods bind, listen, accept, and so on. Perhaps socket would be a subclass of another class, "file descriptor", which would provide the operations close, write, etc. In such a hypothetical object-oriented design, we might have an object s of type socket and call:

 s->send (...)  
Instead, in C, we call
 send (s, ...)
The two notations are formally equivalent, and in fact most compilers internally converts the first form to the second. This notation can generally be used to express any object-oriented concept in languages that are not object oriented.

For examples of how these functions can be used in actual practice, refer to Homework 1, which uses the TCP protocol. The alternative is to use the UDP protocol. Programs that use UDP are similar to those using TCP but we specify "udp" instead of "tcp" (of course), and SOCK_DGRAM instead of SOCK_STREAM.

There are many differences between the TCP and UDP protocols, but for the application programmers two of the differences are essential: TCP is stream-oriented and reliable, UDP is packet (datagram) oriented and unreliable. We describe these terms in the remainder of this section.

A packet-oriented protocol is one in which each send or write operation corresponds to exactly one read or receive (recv) operation. If the buffer given to the receive operation is too small for the packet (or datagram), the packet is truncated to fit, and the bytes that do not fit are discarded.

In contrast, in a stream-oriented protocol is one in which the bytes are treated as being in sequence, with no boundaries introduced by the send operation. For example, a sequence of send operations may be combined and all the data received in a single recv operation. Conversely, the data sent using a single send operation may be received by multiple read operations. The system is free to combine or partition the stream at will, as long as the bytes are delivered in the correct order, and the application has no control over how many bytes are received each time.

With a reliable protocol, either all the data sent is correctly received, or the sender is informed that at least some of the data was not delivered. With an unreliable protocol, some of the data may be lost. With an unreliable datagram protocols, datagrams might even be delivered out of order.

For most applications we use TCP. The reliability is exactly what is needed for most systems. The stream-oriented nature generally means we have to check whether we've received all the data we expected, and loop back and read again if we haven't -- this is a potential pitfall that programmers need to be aware of. The program may test fine on selected inputs, and fail unexpectedly and surprisingly on a larger range of inputs. For all but the most trivial applications, always have recv inside a loop, and continue looping until all the expected data has been received.

UDP is only used by specific, advanced applications for which the stream model or the reliability of TCP are inappropriate. For example, in a real-time situation such as Internet Telephony it is better to lose the occasional packet than for TCP to slow down while lost data is being retransmitted.

4. Network Protocol Implementations

Networking protocols are usually implemented as part of the operating system. The sockets API is essentially used to transfer control to the operating system to perform the required functions, for example, send data. Within this general model are two possible implementations.

One way to implement the API is to provide the functions of the API directly as system calls. A system call executes a special, trap instruction which transfers control to the operating system (referred to as the OS or the kernel). The OS examines the parameters of the trap and decides which internal function to execute, then uses another special instruction to return to the caller. Functions which are internal to the OS but can be called by application-level programs are therefore system calls. System calls are documented in Chapter 2 of the Unix manual, so the man page for bind can be read by typing man 2 bind. Linux implements the sockets API as system calls.

Another possible implementation is as library functions. A library is simply a collection of related application-level code that can be called by application programs (in Java, a library is called an "archive"). The library functions need some way to send and receive the data, perhaps via other system calls, but as long as the OS provides such mechanisms, the OS itself need not implement the specific sockets API (it still does need to implement the underlying protocols -- contact your instructor if this is confusing to you). Unlike system calls, library calls are not automatically available to programs but have to be explicitly linked in. The linking is the final stage of the compilation process in which the executable file is created. On systems where the sockets API is available as a library, the final compilation step must include -lsocket to get the socket functions, and -lnsl to get the name server functions. C library calls are documented in Chapter 3C of the Unix manual, so the man page for bind can be read by typing man -s 3c bind. Solaris/SunOS implements the sockets API as library functions.

Whichever way the sockets API is implemented, the user programs themselves the same, and the only noticeable difference is in the compilation -- which libraries, if any, must be linked in to get the program to work.

Almost every non-trivial implementation of a networking system is multithreaded. To see why this is fundamental, consider that the networking code must be able to react to packets coming from two different sources: one is the network (data that the system receives), and the other is the user code (data that this system wishes to send). One relatively simple way of reacting to data arriving from two different sources (i.e. the network and the application) is to have one thread handle data from one source, and the other thread handle data from the other source. Typically, the part of the program that handles data from the user program is considered the top half (also called upper half), and the part of the code that handles data from the network (or other devices) is the bottom half (lower half). The top half runs when a user program does a system call, and performs any operations needed to get the data to its destination. The bottom half runs whenever the device signals that data is available. The two halves may have to communicate. For example, the receive system call (i.e. the part of the top half that takes data received from the network and gives it to the application) has to check whether any data has already been received from the network, and if not, must block until the data arrives. When the data arrives, the bottom half must somehow make the data available to the top half and unblock the top half so the receive system call can complete.

While two threads is generally the minimum, some networking systems have many more than two threads.

A crucial task of the networking system is controlling the network device (or devices). Network devices can be quite complex, and often perform many operations automatically once they have been set up to do so. In most computers, a device is able to interrupt the main processor when it is ready for the next task. Once the processor is interrupted, it begins to execute the bottom half for that device. Code that interacts with a hardware device is called the device driver. The specific portion of the device driver that is responsible for responding to an interrupt is called the interrupt handler.

5. A Simple Network

In these notes, we build a network from the ground up. To do so, we start with simple and understandable hardware: the computer serial line. Most computers have one or more serial ports. The USB interface found on many of the newer computers is simply a more advanced version of a serial port. A serial port can be connected via a serial line to another computer or to a hardware device. The serial port and serial line can communicate one data bit at a time in each direction, hence the name. The hardware is designed to take sequences of 8 bits and give them to the computer as a byte. Most serial ports can be configured to communicate at different speeds. In general, longer wires will only work with lower speeds, and shorter wires can support higher speeds. For two computers to communicate across the serial line, they have to configure their serial ports to run at the same speed.

The following program, tty.c, can be used to read and write data across a serial line. This is a user-level Unix program. There are some system calls (especially write and read) that request that the operating system transfer data to or from the serial line. The system calls also set the line speed to 9600 baud and request that the line be set into raw mode, so the operating system should not do any special processing for characters that are meaningful on terminals. An example of processing that we do not want to see happening is erasing received characters when the backspace character or the delete character are received. Since we will be sending binary data, and backspace and delete both have binary encodings, if we allowed the operating system to do terminal processing we would lose bytes whenever we happened to transmit those particular bit sequences.

/* tty.c: write to and read from a serial port (ttyS1) */

#include <stdio.h>
#include <sys/fcntl.h>
#include <termios.h>
#include <unistd.h>

#define MESSAGE "hello world\r\n\0"
#define BUFSIZE 1000

main ()
{
  char buf [BUFSIZE];
  int i, j, fd = open("/dev/ttyS0", O_RDWR);
  struct termios tio;

  printf ("fd = %d\n", fd);
  if (fd <= 0) { perror ("open"); return 0; }

  i = tcgetattr (fd, &tio);
  if (i < 0) { perror ("tcgetattr"); return 0; }
  if (cfgetispeed (&tio) != B9600) {
    printf ("ispeed was %d, != 9600 baud (%d)\n", cfgetispeed (&tio));
    cfsetispeed (&tio, B9600);
  }
  if (cfgetospeed (&tio) != B9600) {
    printf ("ospeed was %d, != 9600 baud (%d)\n", cfgetospeed (&tio));
    cfsetospeed (&tio, B9600);
  }
  cfmakeraw (&tio);
  i = tcsetattr (fd, TCSANOW, &tio);
  if (i < 0) { perror ("tcsetattr"); return 0; }

  i = write (fd, MESSAGE, sizeof(MESSAGE));
  if (i < 0) { perror ("write"); } else { printf ("write = %d\n", i); }
  i = read (fd, buf, BUFSIZE);
  if (i < 0) { perror ("read"); } else { printf ("read = %d\n", i); }
  buf [i] = '\0';
  printf ("%s", buf);

  close (fd);
}

tty.c only has a single thread, and only writes and reads once. We really need a somewhat more complex program:

The following program, ttynet.c, shows an implementation of these requirements.

/* ttynet.c: provide serial-line send and receive */
/* link with -lpthread */
/* released under the GPL */

#include <stdio.h>
#include <sys/fcntl.h>
#include <termios.h>
#include <unistd.h>
#include <pthread.h>

/* exported functions, could be in a .h file */
int install_tty_data_handler (int tty, void (*) (int, char));
int write_tty_data (int tty, char data);

/* this should definitely be in a .h file */
#define MAX_TTYS        100

/* keep a mapping from tty numbers to unix file descriptor numbers */
static int tty_fds [MAX_TTYS] = {0, };

/* any static function is NOT exported */
static int initialize_tty (int tty_number)
{
  /* assume no TTY number has more than 100 digits */
  char tty_name [sizeof("/dev/ttyS0") + 100];
  int i, fd;
  struct termios tio;

  if (tty_number >= MAX_TTYS) { perror ("tty number"); exit(1); }
  if (tty_fds [tty_number] != 0) { perror ("tty already open"); exit(1); }
  sprintf (tty_name, "/dev/ttyS%d", tty_number);
  fd = open(tty_name, O_RDWR);
  printf ("fd = %d\n", fd);
  if (fd <= 0) { perror ("open"); exit(1); }
  tty_fds [tty_number] = fd;

  i = tcgetattr (fd, &tio);
  if (i < 0) { perror ("tcgetattr"); exit(1); }
  if (cfgetispeed (&tio) != B9600) {
    printf ("ispeed was %d, != 9600 baud (%d)\n", cfgetispeed (&tio));
    cfsetispeed (&tio, B9600);
  }
  if (cfgetospeed (&tio) != B9600) {
    printf ("ospeed was %d, != 9600 baud (%d)\n", cfgetospeed (&tio));
    cfsetospeed (&tio, B9600);
  }
  cfmakeraw (&tio);
  i = tcsetattr (fd, TCSANOW, &tio);
  if (i < 0) { perror ("tcsetattr"); exit (1); }

  return tty_number;
}

struct receive_thread_arg {
  void (* data_handler) (int, char);
  int tty;
};

static void * tty_receive_thread (void * argument)
{
  /* cast the argument back to a pointer to the receive_thread_arg */
  struct receive_thread_arg * rta = (struct receive_thread_arg *) argument; 
  void (* data_handler) (int, char) = rta->data_handler; 
  int tty = rta->tty;

  printf ("tty_receive_thread is starting\n");
  /* we have read the argument, it won't be used ever again, so free it */
  free (argument);
  /* set the argument to NULL to guarantee it won't ever be used again */
  argument = NULL;

  /* loop forever, and whenever data is received, call the data handler */
  /* when no data is available, the loop blocks on read. */
  while (1) {
    char buffer [1];
    int i = read (tty_fds [tty], buffer, 1);

    if (i == -1) {
      perror ("read");
      exit (1);
    }
    if (i == 1) {
      /* deliver the data with an upcall */
      data_handler (tty, buffer [0]);
    } else {
      printf ("ttynet error: got value %d from 'read', expected 1\n", i);
    }
  }
  /* we never return, but if we ever did, we'd want to return a void *  */
  return NULL;
}

/* returns the identifier (an integer >= 0) to be used for write_tty_data */
int install_tty_data_handler (int tty, void (* data_handler) (int, char))
{
  pthread_t thread;
  int actual_tty = initialize_tty (tty);
  struct receive_thread_arg * arg =
    (struct receive_thread_arg *) malloc (sizeof (struct receive_thread_arg));

  arg->tty = actual_tty;
  arg->data_handler = data_handler;
  if (pthread_create (&thread, NULL, &tty_receive_thread, (void *) arg) < 0) {
    perror ("pthread_create");
    exit (1);
  }
  return actual_tty;
}

int write_tty_data (int tty, char data)
{
  char buffer [1];

  buffer [0] = data;
  return write (tty_fds [tty], buffer, 1);
}

#ifdef RUN_THIS_TEST
/* this is a sample program to exercise the above code */

/* my data handler simply prints any received data to the screen */
static void my_test_data_handler (int tty, char c)
{
  putchar (c);
}

main ()
{
  int tty = install_tty_data_handler (0, my_test_data_handler); 
  char data_to_send [] = "this is my test data\n123\n";
  int i;

  for (i = 0; i < sizeof (data_to_send); i++) {
    write_tty_data (tty, data_to_send [i]);
  }
  /* wait and see if we receive anything */
  printf ("sleeping 100 seconds\n");
  sleep (100);
}

#endif /* RUN_THIS_TEST */
A few things to note in this code:
  1. No buffer overflows -- we are careful never to have an assignment or data copy operation which might exceed the bounds of a buffer.
  2. Exhaustive testing of function return codes. If there is any error, we will know about it.
  3. Print statements help trace the flow of execution.
  4. No assumptions are made about the relative order of execution of the main thread and the receive thread.
  5. Very careful management of dynamic memory, carefully matching malloc and free.
  6. Memory allocated on the stack (e.g. the buffer in tty_receive_thread) is only used within the scope of the declaration -- we never return to the caller a pointer to this memory.
  7. Simple test code can test the library functions, and is ifdef'd out for the production version.
These are all principles which help produce bug-free C code. The one remaining stumbling block for many people is the use of string functions. Here are a few hints for dealing with those:
  1. to C, a character pointer is just a pointer. This can be expanded:
  2. In C, a string is an array of characters which contains a null character ('\0') to mark the end of the string
  3. that means that whether a (char*) is or is not a valid string depends on (a) whether it is a valid pointer to (b) a valid memory area (c) containing a null character somewhere in the valid memory
  4. unless (a-c) are ALL true, string calls (e.g. strlen) are going to fail: sometimes with segmentation fault or bus error, sometimes returning strange results, and sometimes silently.
  5. note that because of (c), binary data may sometimes look like a string. Calling strlen on such strings will produce strange results, because the null character might not be at the end of the data.
  6. if your program is reading binary data from the network, never use string functions on the data you have read. Use the mem functions (e.g. memcpy) on such data, and always keep track of the number of bytes using a separate integer variable.
  7. if you are reading ASCII (printable) data from the network, never assume that the data is terminated with a null character. If you're going to use a received buffer as a string, always add a null character at the end of the data you have received.
Following these principles should help in writing C code that is relatively easy to debug.

6. A Packet Network

Given that ttynet gives us the ability to transfer individual characters, we want to add the ability to send groups of characters, known variously as frames or packets. To do this, we can use ttynet. Before sending any data, we can send a two-byte integer giving the length of the packet, and then send the data. We would only have to decide whether to first send the most significant byte of the data (big-endian encoding) or the least significant byte of the data (little-endian encoding).

This scheme has one fatal flaw. On a serial line, characters are occasionally lost. Assuming that one character is lost, our receive routine would keep reading until the first byte of the next packet is received. This byte is the first byte of the length, but the receiver does not know that. So the receiver returns that packet to the application, and reads the next two bytes, expecting to find the length of the next packet. Unfortunately, these two bytes are, respectively: the second byte of the length, and the first byte of the data. Putting them together as a 16-bit integer produces nonsense, and the receiver and sender get out of synch. The two may accidentally resynchronize later, but the chances are small, and meanwhile, communication is lost.

The alternative is to use a special character to mark the beginning of a frame. Since all possible characters may be present in the data, we also need some escape mechanism to help us distinguish the special character marking the beginning of the frame from any and all occurrences of this special character in the data. Once such a scheme is available we can send packets on serial lines.

One very simple scheme for framing packets is described by the SLIP protocol, documented by RFC 1055, available at http://www.ietf.org/rfc/rfc1055.txt (all RFCs, that is, all Internet protocol definitions, are available from this site). SLIP defines not one, but two special characters, END and ESC. To quote the RFC,

The SLIP protocol defines two special characters: END and ESC. END is octal 300 (decimal 192) and ESC is octal 333 (decimal 219) not to be confused with the ASCII ESCape character; for the purposes of this discussion, ESC will indicate the SLIP ESC character. To send a packet, a SLIP host simply starts sending the data in the packet. If a data byte is the same code as END character, a two byte sequence of ESC and octal 334 (decimal 220) is sent instead. If it the same as an ESC character, an two byte sequence of ESC and octal 335 (decimal 221) is sent instead. When the last byte in the packet has been sent, an END character is then transmitted.
Phil Karn suggests a simple change to the algorithm, which is to begin as well as end packets with an END character.

This strategy is called byte stuffing -- replacing one byte in the data by multiple transmitted bytes.

Note that the END character, octal 300, is hex C0, the ESC character, octal 333, is hex DB, and the two encoding characters, octal 334 and 335, are hex DC and DD. In general, you should be comfortable converting between hex and other formats. To do this manually we need to write down the binary representation of a number, which can then conveniently be converted to any other format.

The RFC itself is worth reading -- it is only six pages long, including a complete implementation of the byte stuffing algorithm which replaces a single ESC or END character with the appropriate two-character sequence.

The byte stuffing in SLIP is a special case of a more general principle. In general, we want a symbol to mark boundaries within a stream of symbols. If we have a symbol that we can transmit that is not a valid data item, then we can use that symbol to mark the boundary. That is what C does in using the null character (which is not a valid character in a C string) to mark the end of a string. In networking, however, we use binary encodings and we want to be able to transmit arbitrary binary data. As an extreme example, we could actually transmit 9 bits for each byte of data sent, and use the 9th bit to mark the beginning or end of a frame. Likewise we could transmit 8-bit bytes, but only send 7 bits of actual data in each transmitted byte, reserving the 8th bit for this signaling. These two strategies have a relatively large overhead of 1/9 (11%) or approximately 1/8 (12.5% -- the overhead may be larger if we are only sending 8 bits of data, since we would then have to transmit two bytes), and such an overhead is usually undesirable.

With byte stuffing, we only add overhead in relatively rare cases: at the beginning or end of the frame, and when the escape or end symbols are transmitted. If the data has a uniform random distribution, then we will only stuff one byte 2 times out of every 256 bytes sent, for an average overhead of approximately 1%, though in the worst case, the data will consist entirely of escape and end bytes, and the overhead will be 100%. Additional overhead is introduced for very small packets -- again, a packet of size 1 has 100% overhead if its data does not have to be escaped, and 200% overhead if it does. While these are large numbers, it would be unusual to transmit very many escape or end bytes, and if we are only transmitting small packets, we don't usually send too many of them. TCP, for example, consolidates many small transmissions into a single large segment, so that small packets are only sent when absolutely necessary.

Bit stuffing is similar to byte stuffing, but on the bit level. In one example, at least one fifth of the bits in any given transmission are required to be "0" to guarantee receiver synchronization (in this scheme, a "0" is encoded by a transition in the signal, and a "1" is encoded as no change in the signal). That means if we are sending lots of zeros, the receiver can synchronize on the data. However, if we are sending long sequences of "1" bits, the receiver may lose synchronization and have a significant chance of either inserting or deleting a bit.

In this case, the sender monitors the data as it is sent, and after any sequence of 4 data bits with a value of "1", it automatically inserts ("stuffs") a "0" bit. This means the receiver can rely on the fact that a sequence "11110" represents only four (not five) actual bits of data. To give concrete examples, the four bits "1110" are sent unchanged as "1110", the five bits "11110" are sent as "111100", and the five bits "11111" are sent as "111101".

In this kind of bit stuffing, the worst-case overhead is 25%, but in most cases, the overhead is about 1/16 of 25%, or 1/64 -- 1.6%.

Implementing the Packet Network

We can use the same strategy to implement SLIP as we did for ttynet, meaning we will have an upcall for receiving data and a write function to send data. Adapting the code from RFC 1055, here is our packet network.
/* slipnet.c: provide serial-line send and receive of packets */
/* link with ttynet and pthreads */
/* released under the GPL */

#include <stdio.h>
#include <pthread.h>

#define MAX_SLIP_SIZE   1006
#define MAX_SLIP_SEND   1006
#define END             0300    /* indicates end of packet */
#define ESC             0333    /* indicates byte stuffing */
#define ESC_END         0334    /* ESC ESC_END means END data byte */
#define ESC_ESC         0335    /* ESC ESC_ESC means ESC data byte */

#define MAX_TTYS        100

/* exported functions, could be in a .h file */
int install_slip_data_handler (int, void (*) (int, char *, int));
int write_slip_data (int, char *, int);

/* buffers for the data */
static char receive_buffer [MAX_TTYS] [MAX_SLIP_SIZE];
/* this is the position to which we add newly received characters */
static int receive_position [MAX_TTYS];
/* record whether the last character for this buffer was an escape character */
static int escaped [MAX_TTYS];
/* true if an error was detected in the current frame */
static int error_frame [MAX_TTYS];
/* serialize all access to the buffers */
static pthread_mutex_t receive_mutex [MAX_TTYS];
static pthread_mutex_t send_mutex [MAX_TTYS];
/* serialize access to the global data */
static pthread_mutex_t global_mutex = PTHREAD_MUTEX_INITIALIZER;
/* the data handlers are also global. */
typedef void (* my_data_handler) (int, char *, int);
static my_data_handler slip_data_handler [MAX_TTYS];

static void print_packet (char * string, char * data, int numbytes)
{
  int i;

  printf ("%s:\n", string);
  for (i = 0; i < numbytes; i++) {
    /* must mask the byte with 0xff, since otherwise bytes greater
       than 0x80 will be converted to negative integers */
    printf ("%02x", (data [i]) & 0xff);
    if ((i == (numbytes - 1)) || (i % 20 == 19)) {
      printf ("\n");
    } else {
      printf (".");
    }
  }
}

static void put_char_in_buffer (int tty, unsigned char c)
{
  if (receive_position [tty] < MAX_SLIP_SIZE - 1) {
    receive_buffer [tty] [(receive_position [tty])++] = c;
  } else {
    printf ("error: slip framing error on port %d, maybe lost END\n", tty);
    /* discard the character -- basically, we don't save it anywhere. */
    /* also make sure the current frame is discarded */
    error_frame [tty] = 1;
  }
}

static void data_handler_for_tty (int tty, unsigned char c)
{
#ifdef DEBUG
  printf ("  received character %x/%o on port %d\n", c, c, tty);
#endif /* DEBUG */
  /* make sure we have been initialized */
  pthread_mutex_lock (&global_mutex);
  /* we have been initialized, so proceed */
  pthread_mutex_unlock (&global_mutex);
  /* acquire the lock for the receive buffer */
  pthread_mutex_lock (&(receive_mutex [tty]));
  if (error_frame [tty]) {
    if (c == END) {
      error_frame [tty] = 0;
      receive_position [tty] = 0;
      escaped [tty] = 0;
    }
  } else {
    if (escaped [tty]) {        /* last character was an escape */
      escaped [tty] = 0;
      if (c == ESC_END) {
        put_char_in_buffer (tty, END);
      } else if (c == ESC_ESC) {
        put_char_in_buffer (tty, ESC);
      } else {   /* this may be a legitimate oversight in the sender */
        printf ("warning: accepting illegal character after ESC\n");
        put_char_in_buffer (tty, c);
      }
    } else {                    /* last character was not ESC */
      if (c == END) {           /* done, give packet to data handler. */
        if (receive_position [tty] > 0) { /* packet is not empty */
          if (slip_data_handler [tty] == NULL) {
            printf ("error: received packet, but no slip data handler\n");
            print_packet ("received packet", receive_buffer [tty],
                          receive_position [tty]);
            receive_position [tty] = 0;
          } else {
#ifdef DEBUG
            printf ("received %d bytes\n", receive_position [tty]);
            print_packet ("received packet", receive_buffer [tty],
                          receive_position [tty]);
#endif /* DEBUG */
            /* note the receive buffer remains locked while we call the
               slip data handler.  If the slip data handler never returns,
               slip will deadlock, i.e., be unable to ever again receive data.
               This would also block the receive thread in ttynet. */
            slip_data_handler [tty] (tty, receive_buffer [tty],
                                     receive_position [tty]);
          } /* if packet is empty, silently ignore */
          /* get ready to start receiving a new packet */
          receive_position [tty] = 0;
        } /* else: silently ignore packets of size 0 */
      } else if (c == ESC) {        /* signal for the next character */
        escaped [tty] = 1;
      } else {                        /* 'normal' character */
        put_char_in_buffer (tty, c);
      }
    }
  }
  /* finally make the buffer available to other threads. */
  pthread_mutex_unlock (&(receive_mutex [tty]));
}

/* returns the identifier (an integer >= 0) to be used for write_slip_data */
int install_slip_data_handler (int tty,
                               void (* data_handler) (int, char *, int))
{
  int fd;
  pthread_mutex_t tmp = PTHREAD_MUTEX_INITIALIZER;

  /* keep thread from executing until we are done initializing */
  pthread_mutex_lock (&(global_mutex));
  fd = install_tty_data_handler (tty, data_handler_for_tty); 
  if (fd < 0) {
    pthread_mutex_unlock (&global_mutex);
    return fd;
  }
  receive_position [fd] = 0;
  escaped [fd] = 0;
  error_frame [fd] = 0;
  memcpy (&(receive_mutex [fd]), &tmp, sizeof (tmp));
  memcpy (&(send_mutex [fd]), &tmp, sizeof (tmp));
  slip_data_handler [fd] = data_handler;
  pthread_mutex_unlock (&global_mutex);
  return fd;
}

/* this is a macro so the return statement returns from write_slip_data */
#define WRITE_BYTE(fd, c)                               \
    if (write_tty_data (fd, c) != 1) {                  \
      pthread_mutex_unlock (&(send_mutex [fd]));        \
      printf ("slip: error writing tty data\n");        \
      return -1;                                        \
    }

int write_slip_data (int fd, char * data, int numbytes)
{
  int byte;

  if ((numbytes <= 0) || (numbytes > MAX_SLIP_SEND)) {
    printf ("slip: bad size %d\n", numbytes);
    return -1;
  }
#ifdef DEBUG
  printf ("acquiring send lock for tty %d\n", fd);
#endif /* DEBUG */
  pthread_mutex_lock (&(send_mutex [fd]));
#ifdef DEBUG
  print_packet ("sending packet", data, numbytes);
#endif /* DEBUG */
  WRITE_BYTE (fd, END);
  for (byte = 0; byte < numbytes; byte++) {
    unsigned char c = (data [byte]) & 0xff;

    if (c == END) {
      WRITE_BYTE (fd, ESC);
      WRITE_BYTE (fd, ESC_END);
    } else if (c == ESC) {
      WRITE_BYTE (fd, ESC);
      WRITE_BYTE (fd, ESC_ESC);
    } else {                        /* normal byte */
      WRITE_BYTE (fd, c);
    }
  }
  WRITE_BYTE (fd, END);
  pthread_mutex_unlock (&(send_mutex [fd]));
  return numbytes;
}

#ifdef RUN_SLIP_TEST
/* this is a sample program to exercise the above code */

/* my data handler simply prints any received data to the screen */
static void my_test_data_handler (int tty, char * data, int numbytes)
{
  printf ("tty %d, ", tty);
  print_packet ("slip received packet", data, numbytes);
}

main ()
{
  int slip0 = install_slip_data_handler (0, my_test_data_handler); 
  int slip1 = install_slip_data_handler (1, my_test_data_handler); 
  int slip2 = install_slip_data_handler (2, my_test_data_handler); 
  char data1 [] = "123\300\333\334\335xxx\300\334\335 321";
  char data2 [] = "\300";
  char data3 [] = "\333\334\335";

  write_slip_data (slip0, data1, sizeof (data1) - 1);
  if (slip1 >= 0) write_slip_data (slip1, data1, sizeof (data1) - 1);
  if (slip2 >= 0) write_slip_data (slip2, data1, sizeof (data1) - 1);
  sleep (10);
  write_slip_data (slip0, data2, sizeof (data2) - 1);
  if (slip1 >= 0) write_slip_data (slip1, data2, sizeof (data2) - 1);
  if (slip2 >= 0) write_slip_data (slip2, data2, sizeof (data2) - 1);
  sleep (10);
  write_slip_data (slip0, data3, sizeof (data3) - 1);
  if (slip1 >= 0) write_slip_data (slip1, data3, sizeof (data3) - 1);
  if (slip2 >= 0) write_slip_data (slip2, data3, sizeof (data3) - 1);
  /* wait and see if we receive anything */
  printf ("sleeping 100 seconds\n");
  sleep (100);
}
#endif /* RUN_SLIP_TEST */
A few things of note:
  1. There is no need to create a new thread in the data handler. We have a choice. In the code above, the receive thread from ttynet is used to call the higher-level data handler. The alternative would be to create a new thread to call the higher-level data handler, freeing the tty-level thread to go back and read more characters from the serial port.
  2. We have a single, statically created array of buffers for the data. If we had an additional thread for the slip data handler, we would have to create an additional buffer for each packet, and the slip data handler would then be responsible for returning the buffer. Returning the buffer can be done in a number of ways. The simplest is to simply call free(). This gives the network protocol stack no control over how much space is used. One alternative is for the slip data handler to return the buffer directly to slip. Slip can then reuse it for a future packet. If slip receives data but all the buffers are in use, slip can discard the data. Most operating systems do something along these lines. They must limit the memory they use, and so on occasion they must discard incoming data.
  3. The code we ported from the RFC was actually substantially modified, but the core functionality is the same.
  4. The test code exercises all the special cases we care about, including each of the special characters and packets of length 1 and length greater than 1.
  5. To debug this protocol, we really need to know what data is received, and what we do with the data. Turning on debugging will do that, though if the protocol had problems, more debugging statements might be needed.
  6. This code only actually handles a single tty. To handle n ttys, we need n buffers, and n copies of each variable, including the mutex and the data handler.

Threads and buffer management are part of what makes "real" networking and operating system code harder to specify and code than "toy" implementations.

7. Summary: our network so far

The network we have described so far can do the following:
  1. allow packet exchange between two hosts
  2. maximum packet size is 1006 bytes (this is the maximum size we send, we are willing to receive up to 1024 bytes from implementations that might follow a different standard)
  3. in this network, packets that are delivered are always delivered in order
  4. in this network, packets that are delivered are always delivered within a specific maximum time, about 1ms/byte average, about 3ms/byte worst case.
Some of the things we might like this network to do, but it doesn't:
  1. allow us to connect more than two hosts
  2. allow for exchanges of more than 1006 bytes
  3. automatically detect and retransmit packets that have been discarded due to errors
  4. automatically detect packets that have been received, but that are missing one or more bytes, or that have had one or more bytes corrupted
  5. send data faster, or a longer distance, than the slipnet can handle
  6. make sure the bytes are actually received in order. For example, the second thread requesting the receive_mutex might obtain it first, and the two bytes would be reordered.
Much of what follows is designed to add these features to this simple network.

Naming in the Internet

Given a simple packet network such as the slip network we have just described, we want to turn it into a more complex network with more than two hosts. This brings us to the idea of naming hosts. With only two hosts, there is no need to explicitly identify the host to which I send data or from which I receive data -- it is simply the host I am directly connected to. With multiple hosts, I need to specify who data is for and I may need to know who is sending me data. The identification of each host is a name or an address.

Identifications don't, strictly speaking, need to identify hosts. The Internet Protocol (IP), for example, uses addresses to identify network interfaces. In our example network, multiple serial ports on a single machine would each be connected to a different system, and each would have its own distinct IP address. Even though in everyday usage we talk of "a host's IP address", this is only correct when that host only has a single network interface. A host with multiple interfaces is called a multi-homed host -- it has a "home" on multiple separate networks.

IP has two different versions. The older version is IPv4, the newer version is IPv6. Addresses in both versions have the following properties:

We contrast IP addresses with domain names in the domain name system, DNS. (Domain names are often called DNS names or DNS addresses). A domain name is a human-readable variable-length string identifying a host, for example, mail.ics.hawaii.edu or www.cs.cmu.edu. Properties of domain names include:

"Identifying a host" is a loose term. More accurately, there is a mapping from domain names to IP addresses. Not all possible domain names have such a mapping, nor do all assigned domain names have such a mapping (for example, www.hawaii.edu might have such a mapping, but hawaii.edu might not). Several domain names may map to a single IP address. A single domain name may map to multiple IP addresses. Some IP addresses may not correspond to any domain names at all.

This mapping is maintained by a distributed database, the domain name system. Each individual or organization that has ownership of a domain name is responsible for maintaining the portion of the database corresponding to that domain name and all domain names that are below it in the hierarchy. This means Information Technology Services (ITS) at the University of Hawaii is responsible for maintaining the portion of the database for all names ending in hawaii.edu. They may delegate some of this responsibility to others. For example, the responsibility for maintaining the portion of the database for all names ending in ics.hawaii.edu has been delegated to the system and network administrators of the Information and Computer Sciences department.

In DNS, each contiguous part of the hierarchy which is under the control of one individual or organization is called a zone.

The main purpose of this distributed database is to translate ("resolve") domain name systems to IP addresses and vice versa. The details are available in RFC 1034 and RFC 1035. Documentation that is both more accessible and more thorough is at http://www.dns.net/dnsrd/.

Domain names themselves consist of labels separated by periods. A label consists of one or more letters, digits (not at the beginning of a label), and hyphens (neither at the beginning nor at the end of a label). The maximum length of a single label is 63 characters, and the maximum length of a domain name is 255 characters.

The process of domain name resolution requires a resolver (someone who wants to resolve a domain name) to query a domain name server. This DNS server may or may not be the same as the authoritative server for the domain that the resolver is in. The resolver must be configured with the DNS server's IP address (not its domain name, since that would cause a bootstrapping problem). To quote RFC 1035,

The resolver starts with knowledge of at least one name server. When the resolver processes a user query it asks a known name server for the information; in return, the resolver either receives the desired information or a referral to another name server. Using these referrals, resolvers learn the identities and contents of other name servers. Resolvers are responsible for dealing with the distribution of the domain space and dealing with the effects of name server failure by consulting redundant databases in other servers.

A domain name server may have the translation, either because the server is authoritative for the zone that the DNS request is seeking, or because the server has cached the result of a previous request. Because the data changes infrequently -- much less frequently, and more predictably, than web pages, for example -- caching can be very effective. Translations received from a server carry an indication of the length of time they may be cached.

If a domain name server does not have the translation, it must be configured with the IP address of a server that is closer (in the hierarchy) to the specified IP address. This means that each domain name server for a zone Z must be configured with the IP addresses of all the servers for the all the zones below Z, and the IP addresses of at least one server for the zone above Z. If such a server receives a query for a domain name for which it is not authoritative and which it has not cached, the selection of the next server to query is automatic depending on whether the name is reached by traversing the domain name tree downwards towards the leaves or upwards towards the root.

The Domain Name System database is designed to distribute resource records (RRs). The most common RR type is "A", which provides the IP address listed for a given domain name. Another RR type is "CNAME", which provides the canonical domain name listed for a given "alias" domain name.

A DNS query or response always has a fixed format. It starts with a 16-bit ID, followed by one bit to distinguish queries from responses, four bits to specify the type of query, a bit to specify whether this response is authoritative, a bit to record that the data had to be truncated, a bit each to specify whether recursion is available or desired, a 4-bit reserved field, and a response code, followed by four 16-bit integers. These integers record the number of RRs in the question section, the number of RRs in the answers section, the number of RRs in the name server section, and the number of RRs in the additional records section. Again quoting from RFC 1035,

  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                      ID                       |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR|   Opcode  |AA|TC|RD|RA|   Z    |   RCODE   |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    QDCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    ANCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    NSCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                    ARCOUNT                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

This header is followed by the specified number of questions, answers, and name server records, and additional records.

A question is always encoded with the domain name, followed by the question type (e.g. "A" for an address translation), followed by the question class (usually Internet).

A domain name is encoded by the concatenation of its labels. Each label is encoded as a one-byte length field followed by that number of characters. The last label must have length 0. For a specific example, the domain name "abab.bbb" would be encoded as follows, where "a" is ASCII 61 (hex) and "b" is ASCII 62 (hex):

0x04 0x61 0x62 0x61 0x62 0x03 0x62 0x62 0x62 0x00

The length of the first label, 4 characters, is encoded first, followed by the four characters of the encoding. The length of the next label follows, then the characters of that label. Finally, a label of length zero marks the end of the domain name. Note that no periods (".") are used!

A label length must be 63 or less, meaning the first two bits must be zero. To compress resource records, a sender may specify, instead of an 8-bit label length followed by characters, a 16-bit number beginning with two "1" bits. The remaining 14 bits specify an offset, in bytes, from the beginning of the DNS packet header, which contains the domain name that logically belongs here. For example, if the domain name "hawaii.edu" is encoded starting at position 29 (hex 1D) from the beginning of the DNS packet, the name "www.hawaii.edu" in a subsequent (or earlier!) resource record can be specified as:

0x03 0x77 0x77 0x77 0xC0 0x1D

Where the ASCII for "w" is hex 77.

All the resource records that are not questions have the following format (from RFC 1035):

  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                                               |
/                                               /
/                      NAME                     /
|                                               |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                      TYPE                     |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                     CLASS                     |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                      TTL                      |
|                                               |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|                   RDLENGTH                    |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
/                     RDATA                     /
/                                               /
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

The name, type, and class match the corresponding question. The TTL, time to live, is a 32-bit number of seconds that this answer may be cached. RDLENGTH is the length in bytes of the RDATA field, which contains the desired answer.

The format of the RDATA field depends on the type and class of the response. For "A" queries, the RDATA field is 4 bytes wide and contains the IP address of the translation (if a translation fails, that is indicated in the response code). For "CNAME" queries, the RDATA field contains the canonical DNS name.

DNS queries and replies can be carried over either TCP or UDP, in either case using port 53 on the server. It is more common for queries and responses to be carried over UDP, and for zone transfers to use TCP. The format is the same in both cases, except that

What next?

Although our simple network allows us to transfer packets among two connected hosts, and we have different ways (IP addresses, and DNS names which can be translated to IP addresses) of identifying different hosts, we still can only transfer data to and from hosts that we are directly connected to, that is, hosts to which we have a physical serial cable connection. This is extremely limiting. The next chapter describes how to make multi-homed hosts forward packets so that a packet is delivered to the interface corresponding to a given IP address. Transferring packets across multiple physical networks is called internetworking or internetting, hence the name for IP -- the Internet Protocol.