THE BIOLOGY OF THE INTERNET: A CLADISTIC TAXONOMY
by
Richard Einer Peterson

The Biology of the Internet: A Cladistic Taxonomy The world is moving so fast these days that the man who says it can't be done is generally interrupted by someone doing it -- Elbert Hubbard

To search is nothing in painting. To find is everything. -- Picasso

The metaphor of the Internet is a World Wide Web (WWW) inhabited by spiders. The evolution of the Web is a living ongoing real-time example of Darwinian evolution. The spiders stored away in alcohol-filled half-liter bottles in museums were born before man; the spiders of the Internet are in our lifetime. The Net is evolving so rapidly that the time between generations is almost as short as a generation of mosquitoes. What a beautiful specimen the Internet is!

We are all familiar with the famous taxonomy from biology (1996 World Almanac, p. 192):

("Kinetic Purple Cats Over Fence Gates Soar")

"Kingdom" originally consisted of plants and animals. Bacteria complicated the picture and by now there are anywhere from 8 to 30 kingdoms. This classification scheme has itself evolved into cladistics and cladograms -- words which were born near the time of the birth of the Internet. The origin of the science of cladistics dates back to a World War II prisoner of war held by the British in Italy -- Willi Henning (Sue Hubbell, "How taxonomy helps us make sense out of the natural world"; Smithsonian, May 1996, pp. 140-151). The words cladistics and cladograms date back to 1966.

Cladistics is a biological classification scheme, based on phylogenetic (evolutionary) relationships. A cladogram is a branching tree which diagrammatically displays similarities, differences and time patterns (evolvement). Rather similar to time-series cross-sectional analysis of economics. Phylogeny is the evolutionary history of a group whereas ontogeny is the evolutionary history of the individual. They are themselves related in the saying "ontogeny recapitulates phylogeny" (much criticized, however; see, for example, Stephen Jay Gould, "Freud's Phylogenetic Fantasy: Only great thinkers are allowed to fail greatly"; Natural History, December 1987, pp. 10, 14, 16, 18, 19).

Jonathan Coddington, chair of the entomology department and curator of spiders at the Smithsonian's National Museum of Natural History, believes that "Good taxonomy...has predictive value." (Sue Hubbell, op.cit., p. 143).

He has what he calls "The matrix" -- an arachnoid cladogram which displays 354 characteristics along one axis and 139 genera of spiders along the other axis. A grand total of 49,000 cells which can be analyzed digitally to form family trees ("cladograms").

Which brings us to the family tree of an important segment (kingdom?) of the Web -- the Internet Search Engines. The spiders of the World Wide Web are computer programs that travel the Internet, visiting sites, retrieving documents. and retrieving all documents mentioned in those documents. The process continues until eventually all documents which are cited by other documents will be discovered. These will be gathered into an indexed database. A search engine is a program that searches these databases for the keywords or phrases that they contain (Richard Peterson, "Harvesting Information from the Internet Using Search Engines," April 1996. On-line. Available at http:www2.hawaii.edu/~rpeterso/harvest9.htm).

In my earlier article ("Internet Search Engines," March 1996. On-line. Available at http://www2.hawaii.edu/~rpeterso/engine_.htm), I divided search engines into four basic categories:

(1) Primary Search Engines
Their robots have accessed a significant portion of the Web and the database is keyword searchable. There are just six: Alta Vista, Excite, Infoseek, Lycos, NlightN, and Open Text.
(2) Niche Search Engines
The spiders for these collect information to form databases of a small-fraction, or a specialized segment, of the Internet. There are hundreds of these. Examples of small-fraction search engines are Harvest Broker, Magellan, WebCrawler, and WWW Worm. Examples of specialized segment databases are Yahoo!, CNN Search, Point Top 5%, and Shareware.com.
(3) Mega-Index Search Engines
These do not have their own information gathering spiders. They do, however, provide access to primary search engines, niche search engines, and even other mega-index search engines. There are hundreds; some examples are @once!, Bobaworld, and NetPad.
(4) Simultaneous Mega-Index Search Engines
These access most of the primary search engines in parallel (all at once). There are just three: MetaCrawler, Savvy Search, and Search.com.
This four-fold classification scheme does serve the purpose of simplifying the seeming chaos of how to find information on the Internet: there are only six primary search engines. And yet, in a sense, it fails to acknowledge the emerging trends which render the scheme almost immediately out-of-date. There are now search engines which do continuous searches according to your particular interest -- "personalized search engines." And search engines which gather and index corporate databases -- "Intranets" instead of "Internet." And commercial search engines which are competitors of the free search engines of the Internet; examples are Lexis/Nexis, ABI/Inform, Uncover, and Dialog..

Within the four-fold classification scheme itself, however, there are time-dimensioned cladograms available for examining the cross-section of search engines, however defined. As a simple example, consider just the six primary search engines as comprising the rows of the matrix. There could be over a hundred columns delineating their various characteristics such as

The Internet is so dynamic that the matrix ever needs new columns which themselves imply future trends and directions. The set of primary search engines change as well -- so additional rows need to be inserted into the matrix.. Alta Vista, the premier search engine, came on-line just a few months ago in December 1995. Rows are never deleted because they are part of the history and evolution of the matrix.

The title of the table itself is up for grabs. In what I have just presented, it was "Primary Search Engines." But that is just a subset of Search Engines -- both primary and niche. There is, then, a family of search engines.

The Internet, however, consists of many more things than just search engines. The Internet has many kingdoms. It has many spiders, ants and worms. It has many human beings in many countries. It has directories which are like the table of contents of millions of documents. It has search engines which are like indexes to all the documents. It is a wild and woolly world which needs a map. It needs, of course, a family of maps. It needs the clarity of classification scheme. It does not need to wrestle for centuries for a model to use, as did biology. Their current standard is the science of cladistics. Hence, the "biology of the Internet."