Distant Data: Distribution and Shared Access of Commercial Databases Over a National Network

Ted Koppel

Introduction

The Colorado Alliance of Research Libraries began in the 1970's as a non-profit organization with the goal of promoting cooperation among its members, seven academic and university research libraries in Northeastern Colorado. The original members were the Denver Public Library, the University of Denver, the Colorado School of Mines, the Auraria Libraries, the University of Northern Colorado, the University of Colorado at Boulder, and Colorado State University in Fort Collins. (Colorado State University dropped out of the Alliance in the early 1980's but rejoined in 1991.)

In the late 1970's, it became clear that one important factor in interlibrary cooperation would be a shared online catalog. Development began in conjunction with the DataPhase Corporation of Kansas City, but they never delivered a suitable system to CARL, and in 1981 Ward Shaw and Trish Culkin began the design and implementation of a locally developed public access catalog. Bibliographic records were loaded during 1982 and 1983, and the first terminals on the new system came up in late 1983. By 1985 there were more than 150 terminals connected. In contrast, as of May 1993 there are about 1425 terminals connected to the CARL central computer in Denver, as well as a number of process "servers" or creators that begin PAC jobs on an ad-hoc basis as request arrive over various telecommunications networks.

Since 1984, CARL has developed and installed several other software modules for library management, including a Circulation system, an Acquisitions system, a Bibliographic Maintenance system, a Serials system, an Electronic Mail system, as well as hundreds of support and batch programs.

During the middle 1980's, CARL was approached by other libraries within Colorado about sharing its software. Various marketing structures were tried in an effort to sell the CARL Integrated Library management package both in and out of the state of Colorado. In 1988, the Colorado Alliance of Research Libraries spun off a for-profit subsidiary, called CARL Systems Inc. The goal of the new organization was to promote the sales and support of the CARL Library System, as well as to develop other databases and projects of value to the library clientele

CARL originally chose Tandem computers' fault tolerant line of equipment as its computing platform. Although fault tolerance and minimal downtime are significant factors in this choice of hardware, for the purposes of this article, the ability to easily network multiple Tandem sites is of particular importance.

In Fall 1988, CARL Systems, Inc., began the creation of UnCover, an index and delivery service for journals. UnCover indexes the tables of contents of over 14,000 journals - almost 4,000,000 articles - and is as current as the journals that arrived in the mail yesterday. UnCover2 (the delivery service) began in Fall 1991 as full-text fax-delivery service. In March 1993 CARL Systems and Blackwell's (Oxford) announced a cooperative agreement which, among other things, created the UnCover Company. One near-term goal of the cooperative venture is increasing the number of titles covered to 20,000.

EARLY CARL BIBLIOGRAPHIC AND NON-BIBLIOGRAPHIC DATABASES

By 1986, CARL software and Tandem hardware began to be installed at various "clone" sites around the state of Colorado. These sites purchase their own Tandem hardware and ran CARL software on it. Early sites included several public libraries (Pikes Peak Library District, the Boulder Public Library) and the Marmot Project (a western Colorado consortium of thirteen libraries and library systems covering 65% of the land area of Colorado, and comprising about 200 actual libraries.

CARL secured contracts to install systems in Montgomery County, Maryland, Arizona State University (Tempe), Northeastern University (Boston), the University of Hawaii Libraries, the University of Maryland System, and several other sites around the country.

As each new site came up on CARL software and became familiar with CARL software interaction, library management began to realize that the CARL information software platform had utility with databases that were not strictly bibliographic. (Koppel, 1987) Pikes Peak Library District developed and expanded its original "Maggie's Place" menu, with several locally produced databases, such as "Child Care Providers", "Voting Locations", "Parks and Recreation" and several others. Capitol Regional Library Council (Hartford, CT) produced an index to the Hartford Courant (newspaper). CARL (Denver), in cooperation with the Greater Denver Chamber of Commerce, assembled an online almanac called "Metro Denver Facts", and, in cooperation with the Colorado Department of Education, developed a database called "Talent Bank".

However, most of the early efforts to expand menu selections from the non-bibliographic were locally produced databases appealing primarily to an audience geographically centered around that site.

COMMERCIAL DATABASES

The late 1980s saw the convergence of several evolutionary trends. First, the cost of computer mass storage on magnetic and/or optical disk dropped significantly. Technological advances allowed disk drive manufacturers to double or even quadruple disk densities at the same time that prices were coming down. Also, technological advances in fixed (Winchester) disk storage made the physically smaller and more dependable technology more accessible.

Second, the producers of data and information products began to realize that there was a market for locally mounted databases. Until then, most commercial databases were accessed through Dialog, BRS, or Orbit. Many database producers began to think in terms of sit licensing and flat fee contracts, as opposed to the odious per-minute charges prevalent with the commercial online services. Several courageous database producers sold their databases at very attractive prices; however the proliferation of consortia and shared catalogs has greatly confused database pricing policies in recent years. Regrettably, we are seeing a tendency on the part of database producers to overprice their products, particularly to consortia, because they don't fully understand the market, or database use within the market.

Third, and most important to this discussion, was the realization on the part of libraries that it would be beneficial to their clientele to allow searching of indexes using the same terminals and software that their patrons already knew and used in their Public Access Catalog. Online database access can by design handle multiple simultaneous users. CD-ROM technology, although useful, can only handle a single search at a time, and usually requires the user to step to a different workstation and use a different command structure.

CARL CENTRAL DATABASES

The Denver CARL computer was the first to load commercially-produced databases onto its PAC. The initial test of concept was the loading of A Matter of Fact (Pierian Press, Ann Arbor Michigan). The original database was supplied on magnetic tape and was updated semi-annually. Although the database itself was small and enjoyed limited patron use, it showed that non-bibliographic data could be mapped to MARC-like records, indexed, and made retrievable to the public.

CARL's second experiment was with ABC-CLIO's Kaleidoscope: Current World Data. Kaleidoscope was originally published on index cards and was an almanac-like compendium of facts and statistics about all the countries of the world. Kaleidoscope was originally made available at no charge to CARL, again as an attempt to discern the transferability of other types of data to CARL hardware and software.

Choice Book Reviews followed. Although a small database (about 30,000 reviews since 1988), it enjoys a great deal of use because of the lengthy critical reviews included each month. The ISBN numbers included with each Choice review are matched against the ISBN files of CARL member institutions, and the bottom of each Choice review displays the names of those institutions that own that title.

The Information Access Company then approached CARL about making available their Business Index and Magazine Index. In conjunction with online access to the indexes, IAC made available to CARL the microfilm collection of supporting full text documents. As part of CARL's PAC implementation, CARLwould, for a fee, print out a copy of the microfilm and fax the image to the requester. In addition, IAC sent CARL magnetic tapes with the article full texts so that it could be retrieved and displayed (in ASCII) online. In 1991, CARL added IAC's Enhanced Academic Index, and in early 1992, Company ProFiles and National Newspaper Index. In 1992, IAC Legal Resources Index was mounted at Gonzaga Unversity in Spokane, Washington.

In 1991 University Microfilms International (UMI) contracted with a CARL site to load their UMI Newspaper Index; CARL is currently planning to load two additional UMI Indexes in calendar year 1993; ABI-Inform and General Periodicals Index are to be indexed and available in late summer.

In 1988-89, Arizona State University licensed six databases from the H.W. Wilson Company. Humanities, Social Sciences, Applied Science and Technology, Business Periodicals Index, Education Index, and General Science Index were loaded and made available on the ASU machine. In 1991, Northern Arizona University licensed the Biological and Agricultural Index (another Wilson index) for its use. In 1992, the INLAN network (computer housed at Gonzaga University, Spokane, WA) mounted Library Literature. Wilson has changed its tape output format since CARL's initial loading of Wilson files; future file loads will adapt to Wilson's implementation of MARC from their old export format.

In late 1991 CARL introduced its load of the 800,000 record ERIC database. ERIC on CARL differs from most of the other commercial databases on CARL in that the abstracts were not to be Word indexed. ERIC's extensive thesaurus-term assignment obviated the need for indexing words in the abstract, and the CARL Access Issues Committee specified the indexing rules.

In late 1988, CARL Systems, Inc., began the construction of its UnCover database which gives access to more than 4 million articles in about 14,000 journals.

TOPOLOGY

As mentioned previously, Tandem system architecture allows geographically dispersed computers to interact with each other as if they were local. Of the database groups enumerated above, several are housed in Spokane (Library Literature, IAC National Newspaper Index, and Legal Resources Index), one is in Boston (UMI Newspaper Index at Northeastern University), six are in Tempe Arizona (the Wilson indexes) and all of the rest are mounted on CARL's Denver Tandem machine. Nevertheless any licensed user on any of the Tandems attached to the CARL network have immediate and easy access to any of the files on any of CARL 'clone' sites, subject to licensing permission.

Tandem makes available a utility that is called the EXPAND networking system. EXPAND allows a telecommunications link with a standard file name and addressing system that makes remote files accessible as if they were local. Currently, all EXPAND connections are routed through the Denver hub and then rerouted to the final home destination of the file. Processes (called $CNET processes) control the activities of the various requesters to the various hosts; Tandem's interprocess communication facility through their operating system allows incoming PACs to initiate new PAC sessions on remote machines. Since all of the CARL clone sites use essentially the same PAC software, the user is equipped to immediately use the database at the remote site since he/she is already familiar with software at his home site.

$CNET software has advantages and disadvantages. A major advantage is that the local accesssor's site administrator needs to know very little about the databases on the remote site; the $CNET process takes care of regulating file opens and sessions. The number of simultaneous sessions can be controlled and limited in order to control traffic on the telecommunications line. (In early $CNET implementations, the number of sessions was limited to 10 each direction because most telecom links were over 9600 baud dedicated lines; since in 1993 most EXPAND connections are currently 56 kB are faster, the number of simultaneous $CNET sessions has been increased to several dozen in each direction.

Another way of handling file access at a remote site is through direct file opens. Because there are no $CNET processes, there is no intermediate process to delay the search. Further, since it is one's home PAC process that is opening the remote files, various security and passwording requirements are far easier to implement. However, since the local process is opening the remote file directly, the local system operator must be aware of file movement and relocation at the remote site or access will be impeded.

ADVANTAGES AND DISADVANTAGES

Since there are two ways to access files on remote systems, it is useful to weigh the advantages and disadvantages of each method.

The $CNET process, as noted above, acts as a 'filter' between the local Tandem machine and the remote Tandem machine. It accepts the request for a connection to a remote system, passes the job name, terminal type characteristics, database choice (optionally) and search strategy (optionally) to the remote $CNET process, which proceeds to start a PAC session on behalf of that terminal. The remote PAC session handles all of the interaction with the terminal until the user decides to return 'home'. Since this 'filter' is responsible for starting PACs and detecting their completion, it can be used to regulate traffic across an EXPAND line. Also, since $CNET plays a key role in the PAC starting process, it can gather various statistics on PAC starts and stops to and from a given site. Setting up a $CNET process is quite easy for both the local and remote sites; the system operator need do little else than run a piece of software as a named process, and all of the default PAC interactions to take place. (Of course, some sites vary the menu choices available to remote sites; the system accommodates a very high degree of customization.)

Another advantage to $CNET is that full control over access rests at the remote site. If, for instance, their system is to be taken down for a cold load or software upgrade, the operator need only stop the $CNET processes to disallow any remote PAC starts. [This can, of course, be seen as a disadvantage to the Remote user!]

An additional advantage to $CNET is that movement of files can be done expeditiously without the need for notifying any remote sites of those moves. Since files are opened on the remote system, only remote file opening processes need to know about file moves, while in the 'open direct' method, the movement of files MUST be reported to remote access sites so that their tables of file pointers can be updated.

There are several disadvantages to the $CNET apparatus. If a system is nearing its capacity to create new processes (currently 255 per processor in Tandem's C30 operating system), then the additional number of new processes being created by $CNET (as many as 36-40 per site) can be enough to cause a given processor to return error messages and reject the starting of a new remote PAC process.

A further disadvantage of the $CNET process is the current limit of 36-40 sessions each direction. While many CARL systems sites are small and do not ever approach that number of simultaneous sessions, there are several multi-campus, high-traffic sites that occasionally reach this barrier.

An additional disadvantage to $CNET is that user validation to licensed databases can be somewhat more difficult when the PAC is started through a $CNET. This is because the remotely-opened PAC needs to open up the proper USERS file on the user's home system. For a variety of reasons, identification of the correct users file is significantly less efficient when done through $CNET.

CONCLUSION

The previous pages describe the implementation of shared databases across

geographically-distributed computer sites, as implemented by CARL Systems, Inc., and its customer sites around the United States. Although some of the specific procedures and nomenclature will not be readily transferable to other hardware platforms, the conceptual model of distributing databases across a network such as CARL Systems' must be considered as libraries attempt to satisfy the information needs of their clients.

Such network structuring offers efficient use of disk space across a network, better utilization of telecommunications links, less duplication of effort among various host computers, and, because of the wide and eclectic nature of the database choices across the CARL Systems network than any single site could attempt to offer by itself.


Ted Koppel is Senior Analyst at CARL Systems Inc, where he has primary technical responsibility for the UnCover article database. He has been building databases for CARL Systems since 1985. Recent work includes the Journal Graphics television transcripts database, the CARL implementation of ERIC, and Choice Book Reviews. Email: Internet: ted@solaris.carl.org


REFERENCES

Koppel, Ted and Ward Shaw "What's that doing here? Non-bibliographic data in a bibliographic environment", RSR: Reference Services Review, Winter 1987, p. 15-19.