This is an individual project. You may discuss concepts and ideas with other students, but you should be the sole author of all your code. All your code must be written in C.
Submission is electronic (see below). Submission must be on time -- late submissions will not be graded and will receive no credit. So please submit what you have on the due date, by 11:59pm HST on September 28th.
The assignment is to write a simple HTTP/1.0 web server. The server is unusual in that it does not serve the file specified in the URL. Instead, it accesses the specified file, searches through it for references, randomly selects one of the references, and and forwards the contents of that reference to the client.
Forwarding the contents is similar to what a web proxy might do.
Selecting one of the links at random is similar to what the quiz server does.
Note: RFC 1945 is a long document. Part of this assignment is to read and understand the document, and identify which parts are relevant to this project and useful to you.
I want you to write a forwarding web server (called webselect). The web server loads the specified html file (only files ending in .html or .htm should be loaded in this first step) and parses it.
The path /home/1/esb/public_html/ (or, for your own testing, /home/n/yourlogin -- but remember to change it back before submitting) should be placed before whatever path is specified in the URL itself, so that all html files specified to your server are relative to /home/1/esb/public_html/.
The purpose of parsing is to extract all the HTML references in the page. References are identified by the string "href=", which may have arbitrary capitalization. webselect must then return to the client the contents of one, randomly selected reference.
The references themselves may point to pages that are not HTML pages, for example images or plain text. For this project, you must be able to process references to files with the following extension and MIME type:
The references may be to files that are on the local machine or on other servers. If the files are on other servers, webselect must act as a proxy, retrieving the file from the other server and sending the contents to the client. The header sent by the other server may be forwarded, unchanged, directly to the client.
If the references are to files that are on the local machine, webselect must act as a server, returning a header to the client, reading the file, and sending it back. The file reference might be absolute, if it begins with a "/" character, or relative to the directory in which the original html file was found.
In summary, webselect should return the contents of one of the references, randomly chosen.
Calling Sequence: webselect must take one or two arguments. The first argument is the port number and is required, and webselect should terminate if it is not present. The second argument is optional. If it is present, it is an integer n giving the number of the selection to return. When this argument is present, webselect must, the first time it is called, return the nth reference (n can take any value from 1, 2, 3, ...). If n is out of range (i.e. there is no nth reference), webselect must terminate the first time a client connects to it with a valid file. After the first connection, webselect must return to using the random selection.
Synopsis of the calling sequence:
usage: webselect port [selection] webselect will terminate if the arguments are not integers, if the port is already in use, or if the selection (if specified) fails to identify a valid reference in the URL specified in the first access to the server. The first reference is number 1.
If you want to use telnet to connect to a server on the same machine, you can simply telnet localhost portnumber
Port numbers below 1025 are generally reserved for systems servers (the root user on Unix), so use higher numbers
After testing using telnet, you should test using a regular web browser (or two, or 10). Most machines have a specific web browser, for example Netscape or IE, and uhunix2 also has the lynx text-only browser. Be sure your server works with both Netscape or IE, and with lynx. You can specify port numbers in URLs by putting a colon after the machine name followed by the port:
http://www.ics.hawaii.edu:99/pathdoes an HTTP request on port 99 of www.ics.hawaii.edu, requesting /path.
You should also test using the TA's test program, which should be available at least a week before the deadline. Plan for this debugging phase to take at least a week.
When testing a network program, there is always the question of knowing what exactly my program is sending, and what exactly it is receiving from the peer. I strongly suggest that you add to your program a compile-time option (disable it before turning in the program) that allows you to see, and maybe save in a file, the entire exchange between your server and the client.
Data read from the network is not in C string format (even though HTTP uses ASCII encoding), and specifically does not include the terminating NULL character.
When sending a C string, you can use strlen to decide how many bytes to write (because you are sending ASCII -- if you were sending binary, you would be unable to use strlen). When reading from the network, specify the buffer length and use the return value from read to determine how many bytes were read.
Remember that your program may or may not get all the data in a single read. The test program will specifically test to make sure your program handles the case where not all the data is received at once. You may have to do multiple reads to get the entire request header, and you may have to do multiple reads to get the entire contents when you are acting as a proxy. Be sure you have received the entire request header before processing the request (however, when acting as a proxy you may start to forward data as soon as you receive it -- just make sure you don't stop until you are done).
HTTP uses CRLF (\r\n) as a line terminator, but not all browsers and servers implement that, and your program should work correctly whether lines are terminated with CR, LF, or CRLF. Remember that you should be generous in what you accept, and strict in what you send.
Part of this project is to parse data received from the network. Parsing in C is often done with "lex" and "yacc". Both of these are available on uhunix2. I suggest that you study them to decide whether they might be the best way of implementing your parser. There is no requirement that you use lex or yacc -- you may do your own parsing if you prefer.
The server I gave in homework 1 in a past course may give you ideas for webselect (I also encourage you to study that homework for more ideas about client-server programming). Feel free to use both that server and the code in the textbook as a basis for this project. Do not use any other code, unless you have written it yourself.
If you haven't done so already, you may want to learn to use the gdb debugger, or any other debugger available on your system. If you use gdb, you may want to check the GDB Manual.
You are welcome to use this case-independent string matching function.