Survey on Spell Checking Xin Chen. Created on 6/25/07, Last modified on 6/25/3007. 1. The theory of spell checking. The task of spell checking is to find out misspelled words in a text, and sometimes also to give a list of correctly spelled replacements. It has been under research for some time. The earliest spell checking program was written in the 1970s. A spell checking facility can be a stand-alone application, or integrated with another text processing application. As explained, the two tasks involved in a spell checker are given below along with possible solution. 1) Find match in a vocabulary. The vocabulary is ususally a dictionary. Such a dictionary can be static when the vocabulary does not change or dynamic otherwise. The dictionary should support efficient search for an item. When the dictionary is dynamic, it should also allow efficient insertion/deletion operations. Data structures that allow efficient search/insertion/deletion include balanced binary tree, B-tree, Hash table and bloom filter. For a static dictionary where only efficient search is needed, a sorted list using binary search can be a good alternative for its ease of coding. Some spell checkers don't use a dictionary but depends on rules extracted from a large body of text. Such mechanisms usually involve scanning a large amount of text for trigrams (consequtive 3 characters strings). Those that happen rarely are labeled as anomalies and thus possible spelling mistakes. The advantage of this is that one can do spelling checking when a dictionary is not available, but then a large body of correct text is needed. The dictionary approach is generally the choice. 2) Suggest correct alternatives. For users that are not good at spelling words, the extra feature of suggesting correct alternatives is very helpful. The question is how to build such a list of alternatives. Obviously these are words close to the misspelled word. Words can be similar to each other in morphology or in pronounciation. 2.1) Morphologically, it is found that most spelling mistakes are 1-letter substitution/insertion/deletion and transposition of 2 adjacent letters. So by flipping the characters in a misspelled word and search for the resulted word in the dictionary, we can find those alternatives. 2.2) Phonetically, it is possible to assign characters into groups (soundex group) based on the similarity of pronounciation (soundex code). Then the misspelled word is converted into its soundex code, and words with similar phonetic features are extracted. In summary, similar words can be found using both string similarity algorithms (dynamic programming techniques like the Levenshtein algorithm) and phonetic matching algorithms (like soundex algorithms or metaphone algorithms). When giving a list of alternatives, there is a question of how to order them. Obviously those words closer to the misspelled word should be listed first, followed by those less likely ones. This is where the "distance" measure comes into play. "Levenshtein distance" is one used by the dynamic programming string matching algorithm. 3) Context-sensitive checkers. The problem of real-word error should be noted. This happens when a spelling can be found in the dictionary but it is not the correct word to use at the location, thus a false-negative is ignored. For example, the sentence "It as sunny today.", the word "as" should be "is". This situation also can happen when the dictionary is too big and contains rarely used words. For the language of English, it is found that this happens when the dictionary used contains more than 90,000 words. Problems like this are related to the context where the word is used. Thus algorithms that analyze the grammatical structure of the text is needed. This is closely related to natural language understanding and processing, and is a hard problem not generally well solved today. Some experimental systems do exist from some large companies like IBM and academic institutes. [1] and [2] are simple but comprehensive introdction to spell checking. [3] gives a good theoretical survey. 2. Implementations of spell checking. Proprietary implementations exist widely. One example is Microsoft office. In open souce implementation, the most famous is GNU's Aspell project. The Java's Jazzy project is based on the Aspell project. The problem using Aspell or Jazzy is that they use LGPL license, so any work derived from them should use LGPL license too and be open souce (although that does not prevent one from collecting payment on distributing the derived work). Many other implementations exist for .NET, java, php, C++ etc. Some are free, some are proprietary. 3. Implement the Hawaiian spell checker. The Hawaiian language was an oral languages initially. Not until it was found by the western world a missionary created the written form for Hawaiian using 12 letters. The vocabulary is expected to be several thousands (?). There are efforts to include modern concepts into the Hawaiian languages. In that sense, it is a dynamically growing languages. The spell checker can be stand alone or integrated. It can be a windows application or a web application. In both cases, it could be implemented in C/C++ as a component (a COM object, or a .DLL file). Then it is closed source and can be easily distributed and used with a clear interface. It could also be implemented directly into ASP.NET in VB.NET or C#.NET, then the source may be viewable. The performance if implemented in .NET is unknown at this time. Although with a relatively small vocabulary, it may not be a problem. If efficiency is a problem, then a implementation in C/C++ with appropriate data structure is needed. But actually I personally think a C/C++ implementation is better. An experimentation can be started following this procedure: 1) Create a dictionary. The dictionary should contain a static part containing the central part of the language, and a dynamic part that corresponds to the added words and those customly allowed by the user. These two can be combined, but separate the two may simplify the task. 2) Implement efficient searching in the dictionary. A fast implementation in sorted list can be used here. Later if insertion/deletion are to be considered, this part can be refactored. 3) Implement string matching algorithm. 4) Implement phonetic matching algorithm. 5) Incoporate it into the environment (web/windows etc). The GNU's Aspell package may be a good source of checking how others do the work. But the LGPL license restriction should be discussed. References and appendix: A. Environment: 1) windows application. C/C++, Java, .NET. GUI. 2) web application. Implemented as a COM component, or directly in ASP.NET, or as database query (may be too slow). 3) study Aspell and adapt it. Aspell has LGPL license though and requires open source. B. Theory: - [1] Wiki http://en.wikipedia.org/wiki/Spell_checker - [2] What is Spell Checker? (similar to Wiki) http://www.compassrose.com/publishing/about-spell-checker.html - [3] Spellchecking by computer. the Simplified Spelling Society, Vol 20, No 1, 1996, pp 4-11. http://www.dcs.bbk.ac.uk/~roger/spellchecking.html Error checking: - two ways of check without dictionary (trigram extraction) Most do use dictionary though. - save space (use rules, hash) - real-word error (false-positive v.s. false-negative) - errors related to grammar and context Error correction: - decide what alternative word to use. - simple errors (substitution/insertion/deletion/transpose) - soundex (soundex code and soundex group) - the SPEEDCOP system - string-matching - Feature vectors (probability) - Phonetic similarity (pronounciation, homophone) - ordering the alternative word list - use semantics (much more difficult, need gramartical structure) - [4] SPEECH and LANGUAGE PROCESSING. (Book) http://www.cs.colorado.edu/~martin/slp.html - [5] Bloom filter http://en.wikipedia.org/wiki/Bloom_filter a space-efficient probabilistic data structure used to test whether an element is a member of a set. Applicable to spell checking. - [6] How to Write a Spelling Corrector - Bayes theorem application. http://norvig.com/spell-correct.html C. Implementations: - GNU Aspell. In C++. http://aspell.net/ Internals of Aspell: http://mail.python.org/pipermail/python-list/2002-October/168982.html Manual: http://aspell.net/man-html/index.html - Can't beat Jazzy. - Java's spell checker API. http://www.ibm.com/developerworks/java/library/j-jazzy/ - is based upon Aspell algorithm. a) Phonetic matching algorithms - soundex algorithms - Metaphone algorithm b) String matching algorithms - Dynamic programming (Levenshtein algorithm) - Simplified version of Google's spell corrector in Python. http://www.norvig.com/spell-correct.html http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html - use of substitution/insertion/deletion/transpose plus bayes model. - A spell checking engine in C++ http://www.codeproject.com/cpp/spellchecker_mg.asp Performance not good enough. Restricted to English. - VB.NET spell check: http://www.dart.com/rsweb.aspx Used in ASP.NET application. $249 - ASP.NET spell checker control: http://www.telerik.com/products/aspnet/controls/spell/overview.aspx Used in ASP.NET application. $199 - Suggester - Spell Checking Java library http://softcorporation.com/products/spellcheck/ Free. - More in JSP, PHP etc. D. Technology for writing COM object. - COM enabled environments: VB, VC++, .NET, ASP. - Writing a COM object with VB6.0: http://www.4guysfromrolla.com/webtech/040300-1.shtml - COM wrappers for C++ classes: http://groups.google.com/group/microsoft.public.vc.language/browse_thread/thread/7b25f813685bd6f1/d72c8711870fe006?lnk=st&q=vc%2B%2B+create+COM+object&rnum=8&hl=en#d72c8711870fe006 - Creating a basic COM class http://edndoc.esri.com/arcobjects/9.0/ExtendingArcObjects/Ch01/BasicClass.htm - Creating a Component using Visual C++ to Manipulate Virtual Directories http://www.15seconds.com/issue/990107.htm - More by search "create COM object in C++". - This is wiki's introduction to COM, with tech articles on how are create COM object in C/C++/.NET: http://en.wikipedia.org/wiki/Component_Object_Model