|
Google's Index to Scholarly Publishers' Archives - A Dark Matter Moderator's Notes - Searching Proprietary Scholarly Content Annual Meeting of the Society for Scholarly Publishing (June 3, 2004) |
|
I had the privilege to moderate one of the many interesting sessions at the Annual Meeting of the Society for Scholarly Publishing (SSP) on the topic of Searching Proprietary Scholarly Content. I also had the opportunity shortly to demonstrate some findings of my research about the significant differences between searching the publishers' archives through their native search engines and through Google's special index generated from the full text archive directly fed to Google by 9 scholarly publishers via the laudable efforts of CrossRef. (You may directly jump to the Side-By-Side program if you don't care about the demonstration.) |
|
|
|
While the idea was good to let Google in on the sites invisible to its spiders in order to show results (at least abstracts) also from scholarly publications, the implementation by Google falls much short of expectations. You can find details about my findings in the June issue of my Digital Ready Reference Shelf hosted by Gale. Here I just provide a link for context to my PowerPoint presentation as the moderator of the session, and illustrate the humble software tool which I made to allow the side-by-side display of the results from the archives retrieved by their native search engines and Google. |
|
|
|
For most of the searches Google retrieves far fewer unique records than the native engines. For the query "dark matter" the native search engine finds 108 items, while Google finds only 66. |
|
|
|
The same is true for Blackwell's archive, where the ratio is 1321 : 694. |
|
|
|
For the archive of the Institute of Physics (IoP) the results were close with a ratio of 631 : 556 (but read further) |
|
|
|
The Nature Publishing Group site showed the biggest difference at: 471 : 131 |
|
|
|
Wiley's site is the only site where Google seemingly found more "hits, with a ratio of 19 : 26 |
|
|
|
But the Google hit numbers can not be taken at
face value, as the results may be greatly
inflated. Why? Because Google counts in the hits the duplicate and triplicate
records,
the abstract, the PDF, and (if applicable) the HTML versions of the very same
article.
These duplicates, triplicates and even quadruplicates are easy to spot only in
those |
|
|
|
The native search engine displays and counts the record only once |
|
|
|
When the duplicates, triplicates are scattered in the result list, they are not as obvious as is the case in Google's search of the Blackwell archive |
|
|
|
Title searching usually can better pinpoint the extent of duplicates, but in some archives the title-only search using Google never yields any result as is the case with the Annual Reviews archive (where the native search engine finds 6 relevant matches), because .... |
|
|
|
... the HTML <TITLE> field of the page is not identical to the title of the article, rendering the intitle option of Google useless. |
|
|
|
The good news is that representatives of several of the interested parties were at the session and realized the problems. My understanding is that Google will look into this matter, CrossRef will have pilots on other Web-wide search engines, and publishers or their digital facilitators would explore how can the article title be made visible to the spiders. |