A TAXONOMY OF INTERNET SEARCH ENGINES

by

Richard E. Peterson

Professor, Financial Economics & Institutions

College of Business Administration

University of Hawaii

Honolulu, HI 96822

(808) 956-7563

FAX: (808) 956-9887

E-mail : peterso@ fei.cba.hawaii.edu

April 1996

KEY WORDS

Key Words: home page, bookmark page, communication highway, types of search engines, primary search engine, niche search engine, small fraction search engine, specialized segment search engine, subject directory, maga index, simultaneous mega index.

ABSTRACT

Home pages and bookmark pages are a form of publication for a professor communicating with students and the academic world. A published bookmark page is like an electronic public library which is always open to the Internet traveler.

Finding information is another way by which information itself gets published. This is the forte of the Internet search engine. A search engine classification scheme is proposed, consisting of the

The top three primary search engines are identified: Alta Vista, Lycos, and Open Text. The specialized search engines can be expected to play an important role as the Internet continues its rapid growth.

Harvesting Information From the Internet Using Search Engines

Robots, spiders, worms, Web crawlers and Web ants are all robots. Robots are computer programs that travel the Internet, visiting sites, retrieving documents, and retrieving all documents mentioned in those retrieved documents. The process continues until eventually all documents which are cited by other documents will be discovered. These will be gathered into an indexed database. A search engine is a program that searches these databases for the keywords or phrases that they contain.

The World Wide Web


A major part and fastest growing portion of the Internet is the World Wide Web (WWW), also known as Cyberspace. The WWW consists of sites which contain documents written in the HyperText Markup Language (HTML). Within these documents there are underlined or colorfully highlighted phrases which are "hot links". Point your computer mouse at them, click, and you will be taken to that new location, a location which could be located continents away. A common name for Internet sites is "home pages" which have an Internet address known as a "Universal Resource Locator" (URL). My home page, for example, has a URL of http://www2.hawaii.edu/~rpeterso


My Home Page


Anyone in the world who is connected to the Internet can access my home page. Although it is still under construction, it does have six major items.

(1) A draft of my article "Internet Search Engines," which provides hot links to 19 search engines and concludes that the top three search engines are Alta Vista, Lycos, and Open Text.
Article on "Internet Search Engines"

(2) Class material for a core course I teach: BEc 353 Macroeconomics in the World Economy. When completed, it will contain the course syllabus and a vocabulary list with definitions for the important terms in the textbook chapters. Since all our students have access to the Internet, they can obtain these materials without the need for me to distribute printed copies.
Chapter 1 Vocabulary

(3) A collection of 700 bookmarks spanning 45 categories. Each bookmark is a hot link and is just a mouse click away from the site involved. Some of the categories are art, Buddhism, business/economics, Internet, HTML, home pages, government resources, and just plain "fun sites."
BOOKMARK PAGE

Metaphorically speaking, my bookmark page is a library in my home with 45 bookcases and 700 books, any of which can be delivered robotically to you in your arm chair for your perusal. Even more miraculously, with each book, there are references or hot links to many other books located almost anywhere in the world, accessible from this same arm chair.

(4) A form for sending e-mail comments to me.
rpeterso@hawaii.edu

(5) A copy of the present paper with hot links to the search engines described herein.
Harvesting Information From the Internet Using Search Engines

(6) A Bookmark Collection of Fifty Search Engines
Fifty Search Engines


Finding information


The Internet Search engine is the engine of a vehicle or taxicab delivering information along the Information Superhighway. The request for information is made via a keyword query to the search engine.

I recently needed to find a copy of the American Psychological Association (APA) guidelines for writing scholarly papers. I submitted the phrase "APR guidelines" to the Alta Vista search engine. After 20 or 30 seconds, the guidelines in summarized form appeared as a clickable link and, a minute or two later, I had the printed copy.


Communication


Communication is , in part, access to information and it is also publication of information. The Information Highway is really a Communication Highway. Both the "access to" and the "publication of" have been revolutionized. Access is virtually instantaneous and is obtainable from search engines. Publication, in a nominal sense, occurs when your Web page is connected to the Internet, and occurs in a real sense when the search engines discover your site or you notify the search engines of your existence.

Although the computer revolutionized computation, the Internet as cyberspace will revolutionize communication. As indicated by the Economist (pp 13-14, March 23, 1996), "You do not have to be a nerd or a mystic to see that historians will look back upon the emergence of 'cyberspace' as a turning point no less decisive than the advent of the computer itself." Whereas the computer does computation, the Internet does communication.

The Variety of Search Engines


There are literally hundreds of search engines, so it is hard to know where to start. Although the top three mentioned earlier are like a council of towering redwoods at the entrance of a forest, the other trees can each, in their own way, provide value-added services. As an initial simplification, there are only four types of search engine, as follows.

Primary Search Engines


Their robots have accessed a significant portion of the Internet and the resulting database is keyword searchable. Thus defined, there are six primary search engines:

Alta Vista
http://www.altavista.com/
Excite
http://www.excite.com/
InfoSeek
http://www2.infoseek.com/
Lycos
http://www.lycos.com/
NlightN
http://www.nlightn.com/
Open Text
http://www.opentext.com:8080/

Niche Search Engines


These search engines have gatherers which can be either human or robotic. They collect information to form databases of a small fraction (under ten percent) or a specialized segment of the Internet. Examples of the small fraction search engine are

Harvest Broker
http://harvest.cs.colorado.edu/
Magellan
http://www.webcrawler.com/
WebCrawler
http://webcrawler.com/
WWW Worm
http://wwww.cs.colorado.edu/wwww

The specialized segment database, known also as a subject directory, comes in two varieties: browsable and searchable. A subject directory is a search engine if it is searchable. A browsable subject directory, such as the Bookmark page on my Web page,is not a search engine. The most famous of all subject directories, Yahoo, is searchable as well as browsable.

When you "do a Yahoo," you use Yahoo to look up a keyword phrase. Yahoo first searches its 21-category subject directory and suggests sites for you to visit; it then uses the primary search engine, Open Text, to search the portion of the Internet covered by its database.

There are over a hundred specialized segment search engines. Here are a few:

CNN Search
Has current CNN news and archived new stories.
http://www.cnn.com/
Point Top 5%
Database consists of the top 5% Web sites.
http://www.pointcom.com/
Shareware.com
A new service from c|net inc. Enables you to browse the most popular, new arrivals, or by subject. Has a searchable database of 160,000 software files.
http://www.shareware.com/
Veronica
A search engine for gopherspace, a text only segment of the Internet.
gopher://munin.ub2.lu.se/11/resources/veronica

Mega Index Search Engines


These do not have robots that scurry to and fro picking up sites. They do, however, provide access to primary search engines, niche search engines, and even other mega index search engines. They attempt to provide an all in one location for accessing information on the Internet. There are over a hundred; here is a sampling.

Bobaworld
http://gagme.wwa.com/~boba/search.html
NetPad
http://eshnav.com/netpad/

Bobaworld accesses each of the six primary search engines and NetPad accesses five of them. NetPad accesses a number of additional sites:

CIA World Factbook Library of Congress
Maps in the news

Simultaneous Mega Index Search Engines


These access most of the primary search engines all at once. There are just two:

Meta Crawler
http://metacrawler.cs.washington.edu:8080/
Savvy Search
http://www.cs.colostate.edu/~dreiling/smartform.html

Conclusion


The information seeker will have a daunting task if there are hundreds of search engines and if all search engines are equal. But they are not all equal.

It is reassuring to know that there are only six search engines that cover a significant part of the Internet. In fact, the Top Three (Alta Vista, Lycos, and Open Text) will between them bring most searches to a successful result.

There are, however, important segments of the Internet that are currently best covered by specialized search engines. If you are lucky, the primary search engine will point you to these specialized databases. In a sense, this begs the question since you are likely to get 200 or so returns for your query and these will take hours to explore. A general knowledge of what's available with the niche search engines, then, should be included in the information seeker's repertoire.