Google's Index to Scholarly Publishers' Archives - A Dark Matter

Moderator's Notes - Searching Proprietary Scholarly Content

Annual Meeting of the Society for Scholarly Publishing (June 3, 2004)


I had the privilege to moderate one of the many interesting sessions  at the Annual Meeting of the Society for Scholarly Publishing (SSP) on the topic of Searching Proprietary Scholarly Content. I also had the opportunity shortly to demonstrate some findings of my research about the significant differences between searching the publishers' archives through their native search engines and through Google's special index  generated from the full text archive directly fed to Google by 9 scholarly publishers via the laudable efforts of CrossRef. (You may directly jump to the Side-By-Side program if you don't care about the demonstration.)

While the idea was good to let Google in on the sites invisible to its spiders in order to show results (at least abstracts) also from scholarly publications, the implementation by Google falls much short of expectations. You can find details about my findings in the June issue of my Digital Ready Reference Shelf hosted by Gale. Here I just provide a link for context to my PowerPoint presentation as the moderator of the session, and illustrate the humble software tool which I made to allow  the side-by-side display of the results from the archives retrieved by their native search engines and Google. 

For most of the searches Google retrieves far fewer unique records than the native engines. For the query "dark matter" the native search engine finds 108 items, while Google finds only 66.

The same is true for Blackwell's archive, where the ratio is 1321 : 694.

For the archive of the Institute of Physics (IoP) the results were close with a ratio of  631 : 556 (but read further)

The Nature Publishing Group site showed the biggest difference at: 471 : 131

Wiley's site is the only site where Google seemingly found more "hits, with a ratio of 19 : 26

But the Google hit numbers can not be taken at face value, as the results may be greatly inflated. Why? Because Google counts in the hits the duplicate and triplicate records, the abstract, the PDF, and (if applicable) the HTML versions of the very same article. These duplicates, triplicates and even quadruplicates are easy to spot only in those
cases when they line up like ducks in a row before crossing the road, as for the search in the IoP archive.

The native search engine displays  and counts the record only once

When the duplicates, triplicates are scattered in the result list, they are not as obvious as is the case in Google's search of the Blackwell archive

Title searching usually can better pinpoint the extent of duplicates, but in some archives the title-only search using Google never yields any result as is the case with  the Annual Reviews archive (where the native search engine finds 6 relevant matches), because ....

... the  HTML <TITLE> field of the page is not identical to the title of the article, rendering  the intitle option of Google useless.

The good news is that representatives of several of the interested parties were at the session and realized the problems. My understanding is that Google will look into this matter, CrossRef will have pilots on other Web-wide search engines, and publishers or their digital facilitators would explore how can the article title be made visible to the spiders.

Back to eXTRA menu