Computer Networks Project 3

The goals of this project are:

to learn about HTTP and application-level protocols
to write a simple (and moderately useful) web robot
to further practice using the sockets interface

This is an individual or group project, at your choice (I am also very open to discussing a different project that is equally challenging -- talk to me if you have something in mind, or for example if you want to work on AllNet). If you form a group, please mail the instructor with the group membership, and include all group members' names and emails in the status file, but you only need to turn in one copy of the project.

You may discuss ideas with other groups or individuals, both on and off the mailing list, but your group must be the sole author of all your code. Your code may be in any reasonable programming language -- check with the instructor if you are not sure whether a language would be appropriate, though certainly C, C++, Java, Perl, Python, and Ruby are all appropriate.

Your implementation must interoperate with at least two different web servers -- for example, the server at www.ics.hawaii.edu (which runs Apache), and the server at www.uhwo.hawaii.edu (which runs IIS). Two different sites running the same web server do not count. In your status file, please report two web sites that you have successfully tested against, and the corresponding server.

The project is due Wednesday, December 5th, 2016, at any time. This is the last day of classes, and no extensions will be given. Accordingly, please begin working immediately. Submission is electronic, following the same rules as for project 1 (each group should only submit one copy of the project, and must cc all group members in the email). No credit will be given for late submissions.

Project Specification

The assignment is to implement a simple search web client, findweb, that can make requests and receive replies using HTTP/1.1, described by RFC 2616 (alternately, you may refer to RFCS 7230, 7231, 7232, 7233, 7234, and 7235, which collectively update RFC 2616).

At your option, you may implement HTTP/2. However, be aware that HTTP/2 is significantly more complex than HTTP/1.1.

findweb is given at least two command-line arguments:

a string to search for, and
one or more URLs, which we will call the root URLs.

The arguments may also include -r, calling for a recursive search.

findweb must parse each root URL (i.e. verify it has the format http://host:port/path or http://host/path), then load the given path from the given host using the GET method -- the result is a root page. If a root page cannot be loaded, findweb should print an appropriate message (reflecting, if possible, the status code received, but with a message that is easy to understand) and immediately exit or proceed to the next root URL. The remainder of this description assumes that this root page was loaded successfully.

If the root page has any Content-type other than text/html, findweb should exit (or move on to the next URL) after reporting the matter -- which is not an error.

If the root page has Content-type text/html, findweb must next search the root web page for the search string, and print the number of times the search string is found in the page and the URL.

If the search is recursive, findweb should also search for HTML tags starting with src=" or href=" and ending with the nearest closing double quote. Three examples of such tags are:

src="http://www.ics.hawaii.edu/img.jpg",
src="/x.gif",
href="y.html",

All HTML tags are case-independent, that is, the first example above could use SrC (or any other combination of upper and lower case) as well as src. Also, you do not have to deal with URLs of the form https:, ftp:, mailto:, or the like -- only http tags or the local tags shown above.

When encountering such a tag, findweb must do the following:

determine whether this URL is hierarchically below the root URL. For example, http://www.ics.hawaii.edu/img.jpg is hierarchically below http://www.ics.hawaii.edu/, but not hierarchically below http://www.hawaii.edu.
If the URL is NOT hierarchically below, findweb should ignore this link. This is not an error, and findweb should not report this. In general, findweb should only output when there is a match or an error (errors should not cause findweb to crash or exit).
If the URL is hierarchically below, findweb must GET the page and, if it is of type text/html, recursively check all its links, but avoiding infinite loops -- for example, http://www2.ics.hawaii.edu/~esb may lead to other pages which have links back to http://www2.ics.hawaii.edu/~esb, but each page should be loaded and examined at most once. For each loaded page, the same checks should be performed as for the root page. If the page cannot be loaded, findweb must print an appropriate error message.

This is a recursive definition -- the recursion ends when the root page and all the pages linked from the root page that are hierarchically below it have been searched. To do this, findweb must keep track of which pages it has already searched, or alternately, of which web pages it has requested.

For every page loaded, i.e. both for root pages and, if -r is used, for recursively accessible pages, findweb must print the number of matches and the URL.

Once all these pages have been requested and searched, findweb has completed its task and should exit.

The URLs specified in the command line are the root URLs. If a root URL leads to a redirect, and the redirect specifies a URL that is not hierarchically below the root URL, then the search on that URL is complete. A redirect that is hierarchically below can be found at http://www2.hawaii.edu/~esb -- the redirect leads to http://www2.hawaii.edu/~esb/, that is, with a '/' at the end. A redirect that is not hierarchically below can be found at http://www.uhwo.hawaii.edu/.

HTTP/1.1

RFC 2616 is long and complex, so this section is a brief introduction that may help you study the document.

RFC 2616 has a lot more detail than you need. For example, at least 8 methods are defined, but you are only required to implement one, GET. The first skill I encourage you to keep developing, as you already have with the other RFCs, is the skill to identify the parts of the document that are relevant to this project, and focus on those, only skimming the remainder to make sure you have not missed anything.

HTTP is a client-server (request-reply) protocol. The client (in this case, your program) sends a request and waits for the reply. The reply may come all at once, or many different read operations may be needed to collect the entire reply.

For HTTP/0.9 and HTTP/1.0, the server would close the connection to indicate the end of the data. For HTTP/1.1, the same behavior can be obtained if the client specifies Connection: close in the request header. If this specification is missing, the server is free to assume the client will send more requests using the same connection. Note that most servers will close any connection that is open and unused for an extended period of time.

An HTTP header is a series of human-readable ASCII lines, each ending with CRLF (represented as "\r\n" in C). However, in being generous in what they receive, most servers will treat the header as correct even if it is only terminated by LF. This means you can use nc to connect to the server port and type in a request header with some chance of getting a reply (none of this is true for HTTP/2).

The end of the HTTP header is marked by a blank line, i.e., a line termination immediately following another line termination. This is true whether the standard is used (line termination being "\r\n"), or a non-standard line ending is used (line termination being "\n", as happens when you connect with nc).

The first line of an HTTP header is different for the request and the reply, and examples are shown in Sections 5 and 6 of the RFC.

The second and following lines of an HTTP header are formatted as follows: field: value, where field might be Content-length, Content-type, Accept, Etag, and so on. The value depends on the field, and is again encoded using ASCII. The value begins after the colon-blank sequence (the blank may be missing), and ends at the end of the line.

The length of the body of an HTTP response can be specified in one of two ways: with a Content-Length: line, or with a Transfer-Encoding: chunked, followed by one or more chunks, each preceded by its length in hex (the last chunk has length 0). Please see http://www.hawaii.edu/lis/programs/ for an example of such a response. Chunked responses are particularly useful when content is generated on demand at different times.

It is only because HTTP headers are ASCII that you can use nc to connect to the http port of a server. If you are confused by this, you can refer back to homework 1. The ability to use nc is a useful aid in debugging. You can also print headers that you send and receive, and cut and paste them into a nc session to test specific items (again, this is not true with HTTP/2). You can also use wireshark to inspect the HTTP headers -- and this should generalize to HTTP/2.

A further way to debug is to run your own web server, and use your client to connect to your server. You can run a trivial (non-responding) server with nc -l, or a full web server, depending on what you are testing. If you run nc in server mode, you can pipe its output to "cat -vet" or "od -t x1" to show non-printing characters either symbolically or in hex, respectively.

RFC 2616 says HTTP 1.1 systems should be able to handle HTTP/1.0 queries and responses. Your system need not ever handle HTTP/1.0 -- any behavior (including none at all) is acceptable when receiving an HTTP/1.0 response.

Some Details

It is relatively easy to build a simple program to get a single URL (specified on the command line), and you should definitely start with this.

After your program is able to get a given URL, you can start to modify your code to check the content type, search for the search string and print the number of occurrences (including 0), and parse any HTML that is received.

The next two steps are to do the full recursive search, and to correctly detect errors and print out meaningful messages.

The last step is to implement redirection correctly. http://www2.hawaii.edu/~esb is a redirected page, and you can use it for your test. You must be a little careful with redirection because of its potential to (a) reach a link that is not hierarchically below the root page, in which case you should not follow the link, and (b) reach a link that has been loaded before, in which case you must terminate this branch of the recursion.

Ideally your program would handle arbitrary-sized web pages, URLs that are arbitrarily long, and an arbitrary number of web pages. For simplicity, you may if you wish restrict page sizes to 50,000 bytes, URLs to 500 bytes, and web pages to a maximum of 1,000. Be sure you can detect when one of these limits is exceeded, and in that case gracefully terminate your program (make sure you print an appropriate error message).

Implementing the above will give you full credit.

I encourage you to start early and, if possible, finish early -- this project may be simpler than the first two projects. If you do finish early, please feel free to turn in the project early -- I will accept a resubmission if you later find a bug that makes a significant difference.

If you want to extend your project once you have turned it in, you may extend it in several ways:

support https, the secure version of http. Most of the formats are the same, but requests and replies are encrypted using SSL, the Secure Sockets Library. Available libraries include OpenSSL and LibreSSL.
support http2, the next version of http, or http3.

There may be similar programs available on the web. Needless to say, any code you turn in MUST be your own, and not taken or adapted from any other program (you are welcome to adapt any code from homeworks 1 and 2, or any of your other projects).

I suggest that you only test your web program on sites that are not likely to contain proprietary material, or to interpret your searches as signs of impending attack. Sites hosted within .hawaii.edu may be particularly appropriate -- while testing, I suggest you load this web page, which should point any overzealous overseer in the right direction. If you want to be particularly appropriate when accessing other domains, load the file robots.txt from the root of the domain, and follow its directives.

I do make mistakes. If you find mistakes in this project, please send mail to me or to the mailing list.

Timeline

If you begin working on November 14th, you have three weeks to finish the project. To make sure you stay on track, I would suggest the following goals:

by November 21st, build a program to download a URL and search for the search string in the web page. If you downloaded printable content (of type text/ anything), print the array (after adding \0 at the end) to verify what you downloaded. Practice using it on a website or two, to see what you get. Feel free to practice on alnt.org (it is a good test for my own web server!!).
by November 28th, add the ability to search in multiple root URLs and add the recursion, which means searching for all the links in downloaded web pages, and distinguishing links that are hierarchically below the command-line argument from those that are not.
by December 5th, your program should save all visited links in some kind of database (perhaps implemented as a list -- as usual in these projects, efficiency is not a major concern) and check that the links are not in the database before loading them recursively.

Since the web is a big, complicated place, and there is no guarantee I will test your code on the same sites that you test your code on, I encourage you to test your code on a variety of web sites and web servers, and correctly reports all the errors it is required to report. This is a non-trivial task, but you can begin with the first prototype.