ICS 651 (Computer Networks) Project 3 -- Web search program
The goals of this project are:
- to learn how the world-wide web works
- to learn about simple web robots
- to become familiar with HTTP and HTML
This is an individual project. You may discuss ideas with your
colleagues, but you must be the sole author of all your code.
I am running a web server at
http://maru.ics.hawaii.edu/~esb/ for your convenience in testing.
You are of course welcome to use other web servers.
You may use any reasonable programming language, as long as you do
all the work required for the project. This means you are NOT
allowed to use any special HTTP or HTML libraries, even though they
may be provided with the language you select. Your program must establish a
connection, generate and send the request, listen for, read, and parse
the reply, correctly handling any errors. Also, I must be able to
test your code on a linux box or on one of the department Suns. If you
have any questions as to the suitability of a specific programming
language, send me mail.
The project is due Tuesday, April 27th 1998, at 4pm HST.
Submission is electronic:
- send e-mail to esb@hawaii.edu
- subject line must be "project 3 -- file-name": file-name is the
name of the source file in that email
- be sure to include your makefile in your submission
- no attachments please -- ASCII (text) files only
- your main file (file containing the "main" function) must show,
in a comment near the beginning: whether you think your program works,
and if not, what you think the problem is.
Late submissions lose 10% of the grade for each day they are late.
Please submit what you have no later than 4pm April 27th. You may
then submit improved versions after the deadline, and I will give
you the highest possible grade.
Project Specification
Your job is to implement a program, websearch, which
does the following:
- Parse its arguments, including:
- optionally, -i for case-independent match of the
pattern against the web data
- optionally, -l to indicate the program should show only
the matching URL(s), and not individual matching lines.
- optionally, -v to only show non-matching lines (or URLs if -l
is present)
- optionally, -d=N (for some number N) to search up to
depth N links. The value N defaults to 0 if this option is not given,
that is, you only search the given URL.
- required, a string to search for
- required, one or more URLs to search through
(in short, usage: websearch [-i] [-l] [-v] [-d=N] pattern URL [URL*])
- keep a set of URLs to search, each with a specific depth. The
search order is arbitrary, i.e. you may pick any order that works for
you. The URLs specified on the command line have depth zero.
- keep a current depth, initialized to zero
- while there are URLs to search, do the following:
- pick a URL from the set, and remove it so it is not accessed again
- fetch the data from the server, using HTTP/1.0 as described in RFC 1945
- check the content type, and discard this URL unless the content
type is text/html or text/plain. Also discard any URLs that
return data with a non-empty Content-Encoding. If you have
discarded this URL, start over with the next one. Also discard
the URL if you have an error, and print an appropriate message.
- search the retrieved data for occurrences of the pattern, and
print either the URL (if -l has been specified) or
matching lines.
- also search for links (identified by 'href='). The depth of
these links is one more than the current depth. Unless the
current depth is already the maximum depth, and if these
are http links, add these new URLs to your set of URLs to search.
Notice that the links might be relative to the current URL. If
so, you need to convert them back into absolute links before adding
them to your set.
The following are examples of the desired output format, which
is inspired by fgrep(1):
% websearch -i biagioni http://maru.ics.hawaii.edu/~esb/index.html
http://maru.ics.hawaii.edu/~esb/index.html: <head><title>Edoardo S. Biagioni</title></head>
http://maru.ics.hawaii.edu/~esb/index.html: <h1>Edoardo S. Biagioni</h1>
% websearch -l -i -d=1 biagioni http://maru.ics.hawaii.edu/~esb/index.html
http://ancl.ics.hawaii.edu/
http://maru.ics.hawaii.edu/~esb/1999spring.ics651/index.html
http://maru.ics.hawaii.edu/~esb/1998fall.ics451/index.html
http://maru.ics.hawaii.edu/~esb/1998fall.ics312/index.html
...
% websearch biagioni http://maru.ics.hawaii.edu/~esb/index.html
(nothing printed out, since there is no occurrence of lowercase "biagioni"
in that web page)
%
Details
Of the HTTP/1.0 methods, you need to implement GET,
you may implement HEAD (to figure out the type of the page),
and you do not need POST.
You do need to appropriately handle all possible errors by printing
an appropriate error message and continuing with the next URL.
You may get a reply beginning with HTTP/1.1 -- for
example, you will get this from the server on maru. Treat this the
same as you would a reply beginning with HTTP/1.0. You may
ignore fields that are not defined in HTTP/1.0, including specifically
ETag, Accept-Ranges, Content-Length,
Connection, Content-Type, and X-Pad. If
you are interested in the meaning of these fields, feel free to study
the HTTP/1.1
definitions.
If you wish to do additional work, you may try and implement any
of the following:
- recognition of loops, so if page A points to page B and page
B points to page A, each page is only searched and printed once
- a graphical display of the "tree" being searched, perhaps
differentiating local from remote links
- an interaction window allowing the user to dynamically control
the search
- parsing and interpreting of the HTML control constructs, so you
can display the results in the appropriate font or (optionally) ignore
comments
- full HTTP/1.1, perhaps including the use of persistent connections
for better performance.
- caching or indexing of pages for later use (a web proxy does
caching; a regular web search engine does indexing)
Note that I do NOT plan to give extra credit. Get the basic
program running first, and only add these features if everything
else is done and you feel like improving your program.
I do make mistakes. If you find mistakes in the design of this
project, please send mail to me
or to the mailing list. I
will probably reply within 24 hours (except on weekends).