Computer Networks Project 3


The goals of this project are:

  1. to learn about HTTP and application-level protocols
  2. to write a simple (and moderately useful) web robot
  3. to practice using the sockets interface

This is an individual or group project, at your choice (I am also open to discussing a different project that is of equivalent difficulty -- talk to me if you have something in mind). If you form a group, please mail the instructor with the group membership, and include all group members' names and emails in the status file, but you only need to turn in one copy of the project.

You may discuss ideas with other groups or individuals, both on and off the mailing list, but your group must be the sole author of all your code. Your code may be in any programming language available on uhunix2 or projects -- check with the instructor if you are not sure whether a language would be appropriate, though certainly C, C++, and Java are all appropriate.

Your implementation must interoperate with at least two different web servers -- for example, the server at www.ics.hawaii.edu (which runs Apache) and one of the servers www.botany.hawaii.edu or www.marbec.hawaii.edu (which run IIS). Two different sites running the same web server do not count. In your status file, please report two web sites that you have successfully tested against.

The project is due Wednesday, May 4th 2005, at any time. This is the last day of classes for this course, and no extensions will be given. Submission is electronic, following the same rules as for project 1 (except groups should cc all group members in the email). No credit will be given for late submissions.

Project Specification

The assignment is to implement a web client, checkweb, that can make requests and receive replies using HTTP/1.1, described by RFC 2616. checkweb is given a single command-line argument, which is a URL which we will call the root URL. checkweb must parse the root URL (i.e. verify it has the format http://host:port/path or http://host/path), then load the given path from the given host using the GET method -- the result is the root page. If the root page cannot be loaded, checkweb should print an appropriate message (reflecting, if possible, the status code received, but with a message that is easy to understand) and immediately exit. The remainder of this description assumes that the root page was loaded successfully.

If the root page has Content-type text/html, checkweb must next search the root web page for HTML tags starting with src=" or href=" and ending with the nearest closing double quote. Three examples of such tags are:

All HTML tags are case-independent, that is, the first example above could use SrC (or any other combination of upper and lower case) as well as src. Also, you do not have to deal with URLs of the form ftp:, mailto:, or the like -- only http tags or the local tags shown above.

When encountering such a tag, checkweb must do the following:

This is a recursive definition -- the recursion ends when the root page and all the pages linked from the root page that are hierarchically below it have been searched. At this point, checkweb has completed its task and should exit.

The URL specified in the command line is the root URL. If the root URL leads to a redirect, and the redirect specifies a URL that is not hierarchically below the root URL, then only a HEAD operation needs to be performed on the redirected URL.

HTTP/1.1

RFC 2616 is long and complex, so this section is a brief introduction that may help you study the document.

RFC 2616 has a lot more detail than you need. For example, at least 8 methods are defined, but you are only required to implement two, GET and HEAD. The first skill I encourage you to keep developing, as you already have with the other RFCs, is the skill to identify the parts of the document that are relevant to this project, and focus on those, only skimming the remainder to make sure you have not missed anything.

HTTP is a client-server (request-reply) protocol. The client (in this case, your program) sends a request and waits for the reply. The reply may come all at once, or many different read operations may be needed to collect the entire reply.

For HTTP/0.9 and HTTP/1.0, the server would close the connection to indicate the end of the data. For HTTP/1.1, the same behavior can be obtained if the client specifies Connection: close in the request header. If this specification is missing, the server is free to assume the client will send more requests using the same connection. Note that most servers will close any connection that is open and unused for an extended period of time.

An HTTP header is a series of human-readable ASCII lines, each ending with CRLF (represented as "\r\n" in C). However, in being generous in what they receive, most servers will treat the header as correct even if it is only terminated by LF. The end of the HTTP header is marked by a blank line, i.e., a line termination immediately following another line termination.

The first line of an HTTP header is different for the request and the reply, and examples are shown in Sections 5 and 6 of the RFC.

The second and following lines of an HTTP header are formatted as follows: field: value, where field might be Content-length, Content-type, Accept, Etag, and so on. The value depends on the field, and is again encoded using ASCII. The value begins after the colon-blank sequence (the blank may be missing), and ends at the end of the line.

Because HTTP headers are ASCII, you can use telnet to connect to the http port of a server (you can refer back to homework 1 if you need details for how to do this), enter your request header by hand, and examine the response. This is a useful aid in debugging. You can also print headers that you send and receive, and cut and paste them into a telnet session to test specific items.

RFC 2616 says HTTP 1.1 systems should be able to handle HTTP/1.0 queries and responses. Your system need not ever handle HTTP/1.0 -- any behavior (including none at all) is acceptable when receiving an HTTP/1.0 response.

Some Details

It is relatively easy to build a simple program to get a single URL, and you should definitely start with this.

After you can get a given URL, you can start to modify your code to check the content type and parse any HTML that is received.

The next two steps are to do the full recursive search, and to correctly detect errors and print out meaningful messages.

The last step is to implement redirection correctly. http://www.ics.hawaii.edu is a redirected page, and you can use it for your test. You must be a little careful with redirection because of its potential to (a) reach a link that is not hierarchically below the root page, in which case you must use HEAD, and (b) reach a link that has been loaded before.

Your program must handle arbitrary-sized web pages and an arbitrary number of web pages. Initially, if you wish, you can restrict page sizes to 20,000 bytes and web pages to a maximum of 100, but for full credit you should not have any such arbitrary limits.

Implementing the above will give you full credit.

I encourage you to start early and, if possible, finish early -- this project may be simpler than the first two projects. If you do finish early, please feel free to turn in the project early -- I will accept a resubmission if you later find a bug that makes a significant difference.

If you want to extend your project once you have turned it in, you can check for much more than "dead links". For example, you could spell-check text files (assuming they are in English, which some aren't). Or, you can collect statistics including access time, number of files, number of lines, number of links per page. Or, you can simply run it on all of your web sites. Or, you can allow different search criteria. There are many web site checkers available, though personally I am only familiar with weblint. Needless to say, any code you turn in MUST be your own, and not taken or adapted from any other website checker (you are welcome to adapt any code from homework 1).

I could also imagine modifying this web client (after you have turned it in) to run on top of your (or someone else's) project 1 and project 2. In order to test it you would then need to write a simple web server -- which should not be terribly hard once you have written the web client. If you have time and interest this summer, it might be a fun project, though not terribly useful (because the TCP and IP that you wrote only interoperate with TCP and IP in the simulated Internet, though with SLIP and serial lines you could definitely send real TCP/IP packets across the network).

I suggest that you only test your web program on sites that are not likely to contain proprietary material, nor to interpret your searches as signs of impending attack. Sites hosted within .hawaii.edu may be particularly appropriate.

I do make mistakes. If you find mistakes in this project, please send mail to me or to the mailing list.