ICS 651 (Computer Networks) Project 3 -- Web search program

The goals of this project are:

  1. to learn how the world-wide web works
  2. to learn about simple web robots
  3. to become familiar with HTTP and HTML

This is an individual project. You may discuss ideas with your colleagues, but you must be the sole author of all your code.

I am running a web server at http://maru.ics.hawaii.edu/~esb/ for your convenience in testing. You are of course welcome to use other web servers.

You may use any reasonable programming language, as long as you do all the work required for the project. This means you are NOT allowed to use any special HTTP or HTML libraries, even though they may be provided with the language you select. Your program must establish a connection, generate and send the request, listen for, read, and parse the reply, correctly handling any errors. Also, I must be able to test your code on a linux box or on one of the department Suns. If you have any questions as to the suitability of a specific programming language, send me mail.

The project is due Tuesday, April 27th 1998, at 4pm HST. Submission is electronic:

  1. send e-mail to esb@hawaii.edu
  2. subject line must be "project 3 -- file-name": file-name is the name of the source file in that email
  3. be sure to include your makefile in your submission
  4. no attachments please -- ASCII (text) files only
  5. your main file (file containing the "main" function) must show, in a comment near the beginning: whether you think your program works, and if not, what you think the problem is.
Late submissions lose 10% of the grade for each day they are late. Please submit what you have no later than 4pm April 27th. You may then submit improved versions after the deadline, and I will give you the highest possible grade.

Project Specification

Your job is to implement a program, websearch, which does the following:
  1. Parse its arguments, including: (in short, usage: websearch [-i] [-l] [-v] [-d=N] pattern URL [URL*])
  2. keep a set of URLs to search, each with a specific depth. The search order is arbitrary, i.e. you may pick any order that works for you. The URLs specified on the command line have depth zero.
  3. keep a current depth, initialized to zero
  4. while there are URLs to search, do the following:
The following are examples of the desired output format, which is inspired by fgrep(1):
% websearch -i biagioni http://maru.ics.hawaii.edu/~esb/index.html
http://maru.ics.hawaii.edu/~esb/index.html: <head><title>Edoardo S. Biagioni</title></head>
http://maru.ics.hawaii.edu/~esb/index.html: <h1>Edoardo S. Biagioni</h1>

% websearch -l -i -d=1 biagioni http://maru.ics.hawaii.edu/~esb/index.html
http://ancl.ics.hawaii.edu/
http://maru.ics.hawaii.edu/~esb/1999spring.ics651/index.html
http://maru.ics.hawaii.edu/~esb/1998fall.ics451/index.html
http://maru.ics.hawaii.edu/~esb/1998fall.ics312/index.html
...

% websearch biagioni http://maru.ics.hawaii.edu/~esb/index.html
(nothing printed out, since there is no occurrence of lowercase "biagioni"
in that web page)
%

Details

Of the HTTP/1.0 methods, you need to implement GET, you may implement HEAD (to figure out the type of the page), and you do not need POST.

You do need to appropriately handle all possible errors by printing an appropriate error message and continuing with the next URL.

You may get a reply beginning with HTTP/1.1 -- for example, you will get this from the server on maru. Treat this the same as you would a reply beginning with HTTP/1.0. You may ignore fields that are not defined in HTTP/1.0, including specifically ETag, Accept-Ranges, Content-Length, Connection, Content-Type, and X-Pad. If you are interested in the meaning of these fields, feel free to study the HTTP/1.1 definitions.

If you wish to do additional work, you may try and implement any of the following:

Note that I do NOT plan to give extra credit. Get the basic program running first, and only add these features if everything else is done and you feel like improving your program.

I do make mistakes. If you find mistakes in the design of this project, please send mail to me or to the mailing list. I will probably reply within 24 hours (except on weekends).