Peter Jacso: Google Scholar and The Scientist

This is a background piece for the interview made with Jeff Perkel for the article in The Scientist. Considering the limitations of the print edition, it is understandable that  only a small part of my argument could be included. I provide here some background illustrations and comments to my correctly quoted remark that Google Scholar (GS) does a really horrible job matching cited and citing references.

 

The interview started with an innocent question about my opinion of the article Citation Counts published in D-Lib Magazine. I said that the comparison of the citedness scores of a single year of the Journal of the American Society for Information Science (JASIS) which showed that on the average GS detects 4.5 more citing references than Web of Science (WoS) shouldn’t serve as a proof of GS superiority. The test results of the other sample year (1985), which showed that WoS had on the average  8.7 more citing items  than GS for the selected 1985 JASIS articles. As the news is always if the postman bites the dog, this did not get the same attention as the other sample.

 

It should have been a warning sign. Google  plays fast and loose in reporting its hits, and so do its competitors. In the scholarly world this may not fare so well after the honeymoon period with GS is over, and serious users start taking a closer look at the a) hits which appear in the result list,  b) the reported citedness scores, and c) the items purportedly citing the ones in the result list.

I knew that I must use some tailor-made examples to get my message through, so I searched GS for articles in The Scientist. GS claims that the most cited article in The Scientist has a citedness score of 7,380.   Poor WoS could come up only with a very few articles from The Scientist which were cited more than 20 times.

 

For the one by Kraulis in The Scientist, WoS found only 2 citing items, but did rightly so.

 

 

It is a tiny, half a page interview (hardly a citable material), made on the occasion that ISI spotted his scholarly paper of 1991 as a “hot paper” which picked up citations at a very fast pace soon after its publication.

 

  It is his paper in the Journal of Applied Crystallography for which WoS correctly identifies 11,619 citing items, and can even list each and every one of them. GS stops showing its marbles it claims to have after the first 1,000.

 

And how many citing references are found by  GS  for the really much cited Kraulis article published two years earlier in the Journal of Applied Crystallography? 231, that is Two Hundred and Thirty One. Well, actually a little more than that as there are almost two dozen other citations scraped from the Web by GS with various misspellings, which is certainly useful for inflating the size of its results list, but not much for else.

 

The typical users searching by the name of the famous MOLSCRIPT program would not find easily the source record for the article. They’d get 7,220 hits, but the record for the real article is at the very bottom of that long list. Unfortunately, the record for the real article has no citedness score. It is the one that would show the full text of the article. For sake of brevity, I show on  the next slide the short result list of a highly specific multiple word query with author name for the article, not the typical user's query.

 

Why is the correct source record at the very bottom of the search result list? Because GS mistakes the journal name for the article title, which is not conducive to accurate citation matching, neither is it very good for GS relevance ranking. Not as most users would go beyond the top 10 hits, especially wading through the many skeletal [CITATION] records.

 

The first purportedly citing item is from the prestigious  Nucleic Acids Research. Unfortunately, when GS was asked to show me the money, i.e. the citing record, it could not find its location.

 

 

I knew that this journal is open access and I would be able to corroborate the citedness claim by looking at its cited reference list 

 

 

 

None of its 17 references cite the Kraulis article, or anything else from the author.

 

 

 

 

 

 

 

I had no luck with the second article purportedly citing the Kraulis article, either.

 

The cached version of the second item brought up the article which cited two articles from Kraulis, but not the article in The Scientist to which GS credited it.

 

The third purportedly citing article  produced an Object Not Found error message, and the 9 additional (very redundant) links offered no full text to corroborate the citation. There may be one or two papers in GS which cite the short interview with Kraulis. WoS identified two of them, but the 7,380 citedness score awarded by GS to the tiny item of Kraulis published in The Scientist is just pure nonsense.

 

As for the citedness score in general, it is often as much inflated as the hit counts. Many of  the purportedly citing articles do not cite the source items as GS claims. (I discuss this in an upcoming paper in the special issue of Current Science which pays tribute to Eugene Garfield on the 50th anniversary of the publication of his seminal article in Science.). Suffice it to repeat here what The Scientist quoted from me: GS often can’t tell apart a page number from a publication year, part of the title of a book from a journal name, and dumps at you absurd data, such as the record of an article which GS happily serves up when looking for upcoming articles on semiconductors to be published in 2006 (possibly available already in the publisher’s archive to which GS has a free pass).

 

One or two citations to articles in press is possible (as is the case with a special tsunami issue of Natural Disasters where authors apparently knew about their fellow contributors’ article). But a citedness score of 98 for an in press article in October, 2005 is too good too be true. 

Actually, it was published  15 years ago, in 1990. 2006 is the article starting page number not the publication year. Mind me, GS has access to the neat and clean metadata of millions of articles, courtesy of the grateful publishers, labeling the data elements as foods are labeled in the fridge of a senior citizen home, but it does not help.

 

Undoubtedly, there are good results, matching source and citing items in the GS results, but the matching algorithm of the autonomous citation indexing process in GS is a far cry from the very intelligent and sophisticated algorithm in CiteSeer which gave the idea for GS.  

 

Not even that would help enough to counter the shallowness of  GS  for this journal (and many others) as illustrated by the results of author name searches, indicating for comparison, the verifiable hits in WoS.

 

The simple topical test searches show similar extreme shallowness in GS vis-à-vis WoS.

 

For the specific single volume of JASIS used in the D-Lib article, the source items were almost identical in GS and WoS, and the  total citedness score reported by  the former is indeed higher. But in case of GS one cannot take at face value these scores because of its very poor matching algorithm, which often produces many phantom matches. The coverage of JASIS and JASIS&T from 1970 to the end of 2005 sheds some light on some of the underlying additional problems.  GS much depends on the largest publishers for source items. John Wiley has a digital version of JASIS only from 1986, making WoS to dwarf the hit counts of GS .

Its coverage of JASIS between 1970 and 1985 is abysmal, scraping mostly [CITATION] records. Comparing the citedness score of articles for any of those years would be far worse for GS than the one for the 1985 issues were.   

 

GS  is not unlike the nice guy at the supermarket, who kindly greets me with a "Hi Paul, how are you and your sister?" I tell him, that I am Peter, and I have no sister, still he bids goodbye by saying Bye Chuck, and best regards to both of your sisters. I am happy that he does not work for the INS, FEMA or the Dept. of Homeland Security. Those guys hopefully don’t rely on GS for scholarly information. You should not either, so don’t cancel your WoS and/or Scopus subscription yet, just because shallow and less than competent journalists and scientists claim GS to be  equivalent to them based on the numbers reported by GS - without understanding, let alone spot checking those hit and citation counts of GS.

By the way, the most cited item in The Scientist according to WoS, is an article by Eugene Garfield, entitled “Citation Data Is Subtle Stuff”. He is right, of course. He is a scholar. He is a scientist, he is the information scientist, and the founder of The Scientist. He has discovered  something about citation data, citation indexing, and citation analysis in the past 60 years or so, as he has started studying citationology in his early 20s,  and not only in 20% of his free time. There is something to learn from him.

Back to eXTRA menu