Web Crawling

X.C. Created on: 12/23/2007, Last updated: 12/31/2007

Basic tasks

There are many reasons for crawling the web: search engine indexing, web mining, automatic web information retrieval.

Four basic tasks and their concerns seem to be: 1) download (choice of language, whether to use multithreading), 2) parse (use library, or personalized parse), 3) storage (on disk, or database), 4) analysis (e.g., data mining, clustering, classification).

Parsing html

Paring of html can be complicated because most browsers don't strictly enforce html format standards. Non-standard html thus populate the entire WWW. This irregularity makes it hard to parse html pages in the real world.

On the other hand, html parsing is simplified by the fact that tags quoted by "<" and ">" always have higher priority than " and ' quoted strings. If a "<" is followed by some alphabet then it is always treated as a tag, no matter it's inside or outside "" or ''.

Html tags can be classified into several types for parsing purpose: 1) single tags. e.g., <img>, <br>. 2) double tags. e.g., <p>..</p> (which is not strictly required), <div>..</div>... These can be further classified into those allow/not allow nesting. 3) combination tags, e.g., <table><tr><td></td></tr></table>. This may be the most complex to parse. Problems like nested table, improperly nested structs, redundant or missing tags make it particularly frustrating to parse a table.

Perl crawler

Perl seems to have strong library support in downloading and parsing the web. A perl script to download site is here. This was tested in DOS on windows XP, Perl version 5.8.8.

C# crawler

I tried C# on web crawling. It uses networking, database, multithreading, parsing, email notification etc. together.

It seems that there is no popular html parser library for C#. A fast search on google found these html parsers. Several methods that are often employed are: 1) use regular expression. But it's not powerful enough in many cases. 2) parse html as xml. But since most html out in the real world are far from standard xhtml, parsing by xml parser can be easily broken. 3) return a flat array of tags. This method could be ok, although quite some additional efforts may be needed to process the flat array of tags. Besides, extracting texts quoted by paired tags can be a problem if there are redundant, missing or improperly nested tags involved. 4) ad hoc methods. These can be quite rigid methods, in which the user has to set a pre-defined template of the fields they want. Crawling each new page needs a new template.

I believe a better solution should be based on formal parsing technique. These are quite heavy for general purposes though. Many people just do casual work on retrieving desired piece of digest, they don't have the motivation to write a full parser while they indeed complain about the lack of one. It is not so easy either.

I believe a light-weighted library based on methods like recursive descent is possibly a good solution. We don't need to parse the entire html text for interpreting or compiling purposes, we just want to be able to easily extract the information we need. Using formal parsing technique like recursive descent is much more powerful than regular expression. Besides, recursive descent is easy to deploy by hand.

I already have a C# html table parser implemented using the recursive descent method. The C# class, including all the helper methods and comments, is about 400 lines in length.

Hyperlinks is another important class of tags. Its parsing should be easier than tables.

Notes on web crawling readings

  1. Wiki - Web_crawler [3]
  2. SearchEngine overview and Architecture of a typical search engine [4]
  3. Optimizations

References

  1. WWW Consortium: http://www.w3.org/
  2. HTML 4.01 specification: http://www.w3.org/TR/html401/
  3. http://en.wikipedia.org/wiki/Web_crawler
  4. Ji-Rong Wen, WSM, MSRA, (2006) SearchEngine Overview:
    http://www.searchforum.org.cn/seminar/lectures/2006-9-25-JirongWen-Search%20Engine%20Overview.PDF
  5. S. Brin, L. Page. (1998) The anatomy of a large-scale hypertextual web search engine. http://infolab.stanford.edu/~backrub/google.html
  6. Google bot: http://www.google.com/bot.html
  7. Search engine marketing tips and search engine news: http://searchenginewatch.com/
  8. Carlos Castillo. PhD thesis: Web Crawling. (2004) http://www.chato.cl/research/crawling_thesis