Internet Insights - Thoughts about Federated Searching

Abridged, text-only version was published in Information Today, 21(9) October, 2004, p.17.

 

 

 

The words federal and federated do not always conjure up positive images. Still, federated is the most expressive adjective when it comes to the consolidated retrieval of results in response to a query sent to several databases hosted by different online information systems. Federated searching is more than multiple database searching, metasearching, polysearching, broadcast searching all of which put the emphasize on searching, but there are other steps in federated searching, and they make the process as difficult as herding cats. Federated searching consists of transforming a query and broadcasting it to a group of disparate databases with the appropriate syntax, merging the results collected from the databases, presenting them in a succinct and unified format with minimal duplication, and allowing the library patron to sort the merged result set by various criteria.

 

The need for federation

Large libraries spend more than one million dollars a year for the digital archives of journal publishers, for abstracting/indexing and for full-text aggregator databases. Still, few patrons discover the richness of these scholarly and professional digital sources, and even fewer use them happily and regularly because they’re not exposed well on many libraries’ home pages. The different interfaces and search languages are further deterrents, and often the database names don’t provide  enough clues to choose them when looking for information about a topic.

Û ABI/INFORM

Û PAIS

Û INSPEC

ÛISTA

Û CINAHL

Many of the host names (Athena, Dynix, Sirsi, EOS) and database acronyms (ABI/INFORM, PAIS, INSPEC, ISTA, CINAHL, Scirus, Scopus) are as Greek to library patrons as the names of the food on the menu in a, well, Greek restaurant. Do you know what’s common in all the databases whose acronyms I just mentioned? They all contain relevant materials for library and information science and technology.

 

Learning about the availability of these databases is one thing. Getting to them by clicking through the labyrinth on many library Web sites is another. Making patrons use them—while applying the strict semantic and syntax rules of Boolean and proximity operators to terms looked up from the thesauri—is yet another thing. No surprise that patrons are happy if they make it through one database with some catch on their hook. They don’t go to see if another database may have more and/or better results. Most give up, storm out of the library and throw at Google the query library anxiety information overload help which will find a few good-enough full-text reports, case studies, and articles among the first few hits of the more than 10,000 open access web pages which come up for the search above. They may never come back to the (digital) library again. And there goes your ROI.

 

 

Greek restaurateurs in Paris’s Quartier Latin display their foods on the street to lure in the tourists. Food marts in Asia offer tasters and display pictures of their culinary masterpieces to encourage wary American and European visitors to try the exotic cuisine, in the hope that they will order a course or two and return again. Similarly, digital libraries have to offer samples of their varieties of intellectual foods. This would encourage patrons to swiftly browse, pick, taste, and consume highly nutritive information—and come back again.

 

Query translation and broadcasting

This summer and fall, I tested the three most popular federated search engines: Ex Libris’ Metalib, MuseGlobal's MuseSearch (through Innovative Interfaces’ MetaFind), and WebFeat’s Prism. Query submission and broadcasting seem to be quite similar in all three software, but behind the scenes there are different  translation procedures to accommodate the query syntax of the target systems. These daunting tasks are invisible for the end user, and one can only guess the differences from comparing the results of the federated search engines versus the native search engine - a topic for my upcoming in-depth article.

I found one major befuddling  problem in query translation and broadcasting in my tests. It is  in WebFeat which is of great importance as WebFeat provides the cross-searching features incorporated in ISI Web of Knowledge (WoK) for many open access archives, and may give users a bad impression of the otherwise impressive high-end WoK service.

 

WebFeat can take the query presented to the native WoK software on the native query template

 

after the results are presented from WoK, by pressing the External Collections Result button to run the query on the selected open access database(s).

 

Great idea, but it does not work reliably for many of the very typical title-author query combinations in WebFeat. Time and again it returned no results when there were one or more records which matched the syntactically valid query in the targeted databases when I ran the query in their native software independently. 

 

I tried many variants with and without truncation, with first name spelled out and with initial only, and there were no results from WebFeat. When I searched by title or name alone, the records matching the original title-author test queries did show up in WebFeat's result list ….

 

…. such as this record as hit number #7 of the 8 records returned for the query journal impact factor in the title field

 

There were other idiosyncrasies in running WoK queries in the open access databases by WebFeat, which does not seem to handle correctly the intra-cell Boolean operators either.

The other federated search engines did not have such a major problem which effects several widely popular and very large databases.

 

Database Menus & Grouping  

In MetLib the systems librarian can set up groups of databases (health, business, computer science, etc.), and the users can add their own preferences or remove databases from the predefined groups, and even save them in their private MySpace area - a great convenience for recurring users.

 

MetaFind has also flexible grouping options, and a database may appear under several categories, such as EBSCO Academic Search Elite does on this library site. The user can specify how many hits should be returned from the databases (from 5 to 100), and how many should be displayed on a page

 

WebFeat also allows a variety of query layouts, but it does not offer an option to control the hits per sources. It displays whatever is the default from each database, and this can yield widely differing ranges from 10 to 100. The uncomfortably long result lists may make the users feel like wading through a marsh, sinking deeper and deeper, especially if an erroneously selected database like Thomas returns 100 "hits"  all of them false hits for the query (more about this later).

 

Scoreboards

All the three software provide a scoreboard of the results from the databases searched. MetaLib shows the progress of the search database by database, but when it starts displaying the actual hits, the scoreboard disappears. You may ask for its re-display, but it  lays over the result list even though there is enough space to juxtapose it. It cannot be repositioned as a moveable pane could be.

 

MetaFind starts the display of the results with a good scoreboard, showing the progress, and displays the result beneath the scoreboard, echoing back the query. 

 

WebFeat's scoreboard appears almost immediately and is being updated as the search progresses across databases. The progress is usually lightning fast, but it has a price. WebFeat does not consolidate/federate the results returned, just presents them on an "as is" basis which sometimes yields unwieldy, ill-structured and disorganized result lists.

 

 

Result lists and record formats

In the rest of the processes (especially in presenting and managing the search results returned by the target databases’ native search engines) there are significant and highly visible differences.

MetaLib’s grid format provides the most succinct and most uniform result list.

 

It also offers the result list in a well-laid out brief format

 

or full format - all the three from the files already fetched from the target databases. The full format links the users to the full-text source either directly or through Ex Libris' excellent SFX option.

The SFX option is available in all the three formats, and is - perfect example of the synergy of federated searching and linking.

 

MetaFind also has a formatted, brief-result list,

 

but it is not as compact as MetaLib’s, and not as consistent across the different resources. It is, however, a nice touch that the user can control how many hits should be listed from the resources. The minimum is 5. I would prefer an even lower number as 2-3 hits per database give enough information for the first glance-over  and still keep a relatively tight, easy to scan  list when showing results from  10-15 databases.

 

The more detailed format of MetaFind tries to ram too much information into the description field, including data elements which are not needed by the user, at this stage (such as the clue to check the catalog) or ever (such as the accession number within the database).

 

Then again, it is better than what WebFeat offers. It has only a single format. For some of the databases the single format has good structure and content to give a clue to the user if it is worth clicking and be catapulted to the native system for the complete abstracting/indexing or the full record. More often, however, there are irrelevant data elements at this stage (such as my otherwise beloved DOI) as well as pieces of information which are never relevant for the end-users (such as intra-database accession numbers). To make the records even more protracted, almost every field starts in a new line which makes the result list look like a cake served in crumbles.

 

In other records on the result list, the title of the article and/or the author name are often missing

 

… and sometimes the title, the author, the journal names are all absent. Fetching the information about the reading level and the size of the articles in Kbytes does not compensate for the lack of title, author and journal name fields.

 

True, you find these when you are taken to the source to view details 

 

but WebFeat could have extracted the author and periodical title from the native system's result list and show it while you are still in the result list for further browsing.

 

Sorting

Sorting is offered by both MetaLib and MetaFind. The former can sort the result set by relevance (this is the default), author, publication year, source database or article title. It works well, except for author sorting where filing by last name and first name or even first initial are not distinguished, thus A. J. Knox and Andrew Kurmis are ahead of Frederik Anseel. 

 

 MetaFind can sort the result list by the same criteria plus by order of retrieval, and it shows the same problem in sorting by author as MetaLib, not distinguishing between last name first and first name first formats, so KF Kaltenborn comes ahead of Kenneth A Borokhovich. Sorting by other criteria  seemed to be correct. 

 

Deduplication

WebFeat does not offer sorting, neither deduplication - two features which are quite a prerequisite to justify using the term federated search engine, and which are very important for reducing redundancy without eliminating the records. No algorithmic deduplication can be perfect but MetaLib and MetaFind both do a remarkably good job even compared to the deduplication efforts of traditional professional information services. MetaLib's most current version 3.12 (a version unavailable for me when my manuscript for the abridged print version went to press) does automatic deduplication when presenting the results, clearly showing in the result list the duplicates detected. I found their algorithm to handle the duplicates with aplomb even in difficult cases where there are punctuation differences in the title and/or author name or the date format.

 

MetaFind's deduping algorithm may not be as sophisticated, but it offers the users option as for their prefererred criteria, such as title, ISBN, ISSN and link URL. The records which have duplicates show the number of duplicates which are kept hidden, but can be invoked to check  the supposed duplicates. The higher the  duplicate numbers the more important the item may be - if they come from databases of different content providers rather than re-purposed records from various databases of the same family.

 

MetaFind must be using rather liberal conditions to detect duplicates which is advantageous when one record misspells the author's name, as PsycINFO does

 

Deduplication worked well in 90% of my tests, making only a few mistakes such as the one shown below which falsely identifies two records as duplicates just because part of their title information overlaps. This is where the liberal approach may backfire.

 

Conclusions

Of course the license fees for the entry-level and high-end federated search systems vary widely. But this fee must be put in perspective by comparing it with the license fees that one pays for the digital resources that will get discovered, be selected more often, and used more effectively by patrons if a powerful federated search engine is in place.

MetaLib and MetaFind are very powerful federated search engines providing comprehensive and high quality federated search services. There are other powerful alternatives, including in-house adaptation of a commercial metasearch software, such as MetaStar. Such in-house developments put the responsibility on the systems librarians' shoulder, but the library of North Carolina State University shows an impressive example how superbly it can be done. More about it later in the digital pre-print version of my Cheers and Jeers for 2004, coming well in December. 

 

Back to the eXTRA menu