HARVESTING INFORMATION FROM THE INTERNET USING SEARCH ENGINES

by

Richard E. Peterson

Professor, Financial Economics & Institutions

College of Business Administration

University of Hawaii

Honolulu, HI 96822

Phone: (808) 956-7563

FAX: (808) 956-9887

E-mail: peterso@ fei.cba.hawaii.edu

March 1997

ABSTRACT

      This paper aims to compare and contrast the Internet search engines in order to put them into categories. Some of the research issues are: (1) are some of the search engines bigger than others in terms of Internet coverage; (2) are some search engines, which cover just a small part of the Internet, never-the-less very useful; (3) are some "search engines" not search engines but merely links to them; (4) are some search engines more like a table of contents than an index. A four-fold classification scheme is proposed as a possible answer to these questions.

PREFACE

      Internet search tools, also known as search or query engines because they help you travel the "information superhighway," are quite numerous. Judith Bernstein (1996) estimates there are over 200 search sites available. The title of her article "Finding a Needle in a Digital Haystack" suggests that what you want to find is going to be difficult because there is so much material on the Internet.

      According to Neubarth (1996), Internet search engines date back to 1989. Prior to 1991, Internet users had to use arcane UNIX commands to search the Net. In 1991, however, the University of Minnesota created a more user-friendly text interface called Gopherspace and search engines with names from the Archie comic strip -- Veronica and Jughead -- soon appeared. The invention of the World Wide Web in 1991 by Tim Berners-Lee enabled the emergence of graphical tools which made navigation much easier for the typical net "surfer."[1]

      Instead of surfing, you can "browse" the Internet going from site to site and, if something of interest is found, save its address in what is called a bookmark. When you reopen the book, or revisit the Net in this case, you can revisit the bookmarked site without retyping its Uniform Resource Locator (URL) address. Before long, you become overwhelmed by dozens if not hundreds of bookmarked sites, many of which you will not recall. In some ways, to browse is to wander aimlessly.

      There are, however, software programs which will do the browsing, not only for you, but also for all Net surfers. The software wanderers -- also known as robots, spiders, worms, and web crawlers -- travel the Web, visit sites, and retrieve documents. Their gatherings are combined into an indexed database. A search engine is a program that searches these databases for the keywords or phrases they contain.

THE VARIETY OF SEARCH ENGINES

      My initial exploration consisted of visiting the most frequently cited search engines. Forty-nine were identified and a table was constructed with 49 rows and 49 columns.[2] For a given search engine, it could be observed whether it was linked by, and whether it was linked to, other search engines. It was the search engines that did not have links to other search engines that became especially important. In the classification that follows, they are referred to as the "primary search engines."

(1) PRIMARY SEARCH ENGINES

      Their robots have accessed a significant portion of the Internet and the resulting indexed database is keyword searchable. [3] Thus defined, there are eight primary search engines:

*Alta Vista at http://www.altavista.digital.com/

*Excite at http://www.excite.com/

*HotBot at http://www.hotbot.com/

*InfoSeek Guide at http://guide.infoseek.com/

*Lycos at http://www.lycos.com/

*Open Text http:/search.opentext.com/

*Ultra at http://ultra.infoseek.com/

*WebCrawler at http://webcrawler.com/

(2) NICHE SEARCH ENGINES

      These search engines have gatherers which can be either human or robotic. They collect information to form databases of a small fraction (under ten percent) or a specialized segment of the Internet. Examples of the small fraction search engine are

*Harvest Broker at http://harvest.cs.colorado.edu/

*Magellan at http://www.mckinley.com/

*WWW Worm at http://wwww.cs.colorado.edu/wwww

*Yahoo at http://www.yahoo.com/

      The specialized segment database, known also as a subject directory, comes in two varieties: browsable and searchable.[4] A subject directory is a search engine if it is searchable. A browsable subject directory, such as the Bookmark page on my Web page [5], is not a search engine. The most famous of all subject directories, Yahoo, is searchable as well as browsable.

      When you use Yahoo to look up a keyword phrase, it first searches its 66-category subject directory and suggests sites for you to visit; it then uses the primary search engine, Alta Vista, to search the portion of the Internet covered by Alta Vista's database.

      There are over a hundred specialized segment search engines. Here are a few:

*Christian Science Monitor at http://www.csmonitor.com/
A free Web-based newspaper with searchable archives dating back to 1980.

*Point Top 5% at http://www.pointcom.com/
Database consists of the top 5% Web sites.

*Shareware.com at http://www.shareware.com/
A new service from c|net inc. Enables you to browse the most popular, new arrivals, or by subject. Has a searchable database of 160,000 software files.

*Veronica at gopher://munin.ub2.lu.se/11/resources/veronica
A search engine for Gopherspace, a text only segment of the Internet.

(3) MEGA INDEX SEARCH ENGINES

      These do not have robots that traverse the Internet. They do, however, provide access to primary search engines, niche search engines, and even other mega index search engines. They attempt to provide an all in one location for accessing information on the Internet. There are over a hundred; here are two that link to all eight primary search engines:

*Hot Links to Big Eight Search Engines
At http://www2.hawaii.edu/~rpeterso/bigeight.htm

*Search.Com at http://www.search.com/

(4) SIMULTANEOUS MEGA INDEX SEARCH ENGINES

      Also known as multithreaded or parallel query engines, they access most of the primary search engines all at once. There is just a small number of them (Meta-Search Engines, 1997); here is a sampling:

*All-4-One at http://www.all4one.com/

*Meta Crawler at http://metacrawler.cs.washington.edu:8080/

*Savvy Search at http://www.cs.colostate.edu/~dreiling/smartform.html

CONCLUSION

      The information seeker will have a daunting task if there are hundreds of search engines and if all search engines are equal. But they are not all equal.

      It is reassuring to know that there are only eight search engines that cover a significant part of the Internet. There are, however, important segments of the Internet that are currently best covered by specialized search engines. If you are lucky, the primary search engine will point you to these specialized databases. In a sense, this begs the question since you are likely to get 200 or so returns for your query and these will take hours to explore. A general knowledge of what's available with the niche search engines, then, should be included in the information seeker's repertoire.

ENDNOTES

1. Jean Armour Polly (1994) is the one who originated the phrase "surfing the Net":

I wanted something that expressed the fun I had using the Internet, as well as hit on the skill, and yes, endurance necessay to use it well. I also needed something that could evoke a sense of randomness, chaos, and even danger. I wanted something fishy, net-like, nautical.

At the time I was using a mousepad from the Apple Library in Cupertino, CA, famous for inventing and appropriating pithy sayings and printing them on sportswear and mousepads (e.g. "A month in the Lab can save you an hour in the Library"). The one I had pictured a surfer on a big wave. "Information Surfer" it said. "Eureka," I said, and had my metaphor.

2. If A denotes the set of 49 search engines, then the table was the Cartesian product A x A.

3. Membership in the primary search engine category fluctates over time:

* Fryxell (1996) discusses a directory (Yahoo) and six search engines: Alta Vista, Excite, InfoSeek, Lycos, Open Text, and WebCrawler. The author submitted five search queries to each and describes the quantity and relevance of the returns provided by each search engine.

*Gray (1996) discusses search strategies for using primary search engines, mega-index search engines, and directories. He rates Alta Vista as the premiere search engine on the Web.

* Peterson (1997) compares the power(number of returns) and precision (number of relevant returns) for Alta Vista, Excite, HotBot, Lycos, Open Text, Ultra, and WebCrawler.

* Venditto (1996) compares seven majorWeb search engines available for free: AltaVista, Excite, InfoSeek Guide, Lycos, Open Text, WebCrawler, and WWW Worm. His article provides an excellent discussion of search strategies.

4.Pegoraro (1997) describes the difference between directories and search engines as follows:

Site directories like Yahoo are lists, edited and maintained by human beings, of Web sites devoted to various topics. They include a relatively small number of pages in their databases -- 500,000 in Yahoo's case. (The total number of Web pages probably exceed 100 million, but nobody knows for sure).

Search engines like Alta Vista automatically compile their listings with programs called "spiders" and "crawlers" that trawl the Web and record the text and topics of pages. Search engines index far more pages than directories -- usually 30 to 50 million.

5.This site is also the home page of BUS313 Economic and Financial Environment of Global Business. Web-based materials for the class include:

* course syllabus

* glossary for important terms from the course textbook

* essay questions

* Internet tutorials

* link to my Top Ten Web-Based Newspapers and Magazines. Each of these is free and each has searchable archives. It was written up by John Marcus (1977). My students are accessing this site frequently for their Web Report projects.

* link to Big Eight Search Engines. This is the second major source for the students' Web Reports

* Student Web Reports from Fall 1996, available at
http://www2.hawaii.edu/~rpeterso/webreprt.htm,
include global companies, Asia-Pacific countries, and global issues.

REFERENCES

      Bernstein, J. (1996, April). Finding A Needle in a Digital Haystack. _Netguide_, pp. 79-80.

      Fryxell, D. (1996, March/April). Nine Web Search Sites Examined. _LINK-UP_, pp. 29-30.

      Gray, T. (1996, May). How to Search the Web: A Guide to Search Tools [WWW document]. URL

http://issfw.palomar.edu/Library/TGSEARCH.HTM

      Marcus, J. (1997, February/March). The Top Ten Web-Based Newspapers and Magazines. _database_, pp. 80-83.

      Meta-Search Engines. (1997, March 4). [WWW document]. URL

http://ds1.internic.net/tools/meta.html

      Neubarth, M. (1996, May). Taming the Chaos. _Netguide_, p. 10

      Pegoraro, R. (1997, January 31). So Many Search Engines, So Little Time. _The Washington Post_ [WWW document]. URL

http://washingtonpost.com/wp-srv/WPlate/1997-01/31/081L-013197- idx.html

      Peterson, R. (1997, February). Eight Internet Search Engines Compared. _First Monday_ [WWW document] URL

http://www.firstmonday.dk/issues/issue2_2/peterson/index.html

      Polly, J. (1994, November). The Nascence of Surfing the Internet. [WWW document] URL

http://www.well.com/user/polly/birth.html

      Venditto, G. (1996, May). Search Engine Showdown. _Internet World_, pp. 79-86.