Levels of tasks: 1) find match in a vocabulary. Data structure for the vocabulary: sorted list, balanced Binary tree, B-tree, Hash table, bloom filter. 2) suggest correct alternatives. Concepts: levenshtein distance (DP), morphology, statistics (n-gram) Two groups of algorithms: a) String similarity algorithms b) Phonetic matching algorithms. 3) Context-sensitive checkers. Environment: 1) windows application. C/C++, Java, .NET. GUI. 2) web application. Implemented as a COM component, or directly in ASP.NET, or as database query (may be too slow). 3) study Aspell and adapt it. Aspell has LGPL license though and requires open source. Theory: - [1] Wiki http://en.wikipedia.org/wiki/Spell_checker - [2] What is Spell Checker? (similar to Wiki) http://www.compassrose.com/publishing/about-spell-checker.html - [3] Spellchecking by computer. the Simplified Spelling Society, Vol 20, No 1, 1996, pp 4-11. http://www.dcs.bbk.ac.uk/~roger/spellchecking.html Error checking: - two ways of check without dictionary (trigram extraction) Most do use dictionary though. - save space (use rules, hash) - real-word error (false-positive v.s. false-negative) - errors related to grammar and context Error correction: - decide what alternative word to use. - simple errors (substitution/insertion/deletion/transpose) - soundex (soundex code and soundex group) - the SPEEDCOP system - string-matching - Feature vectors (probability) - Phonetic similarity (pronounciation, homophone) - ordering the alternative word list - use semantics (much more difficult, need gramartical structure) - [4] SPEECH and LANGUAGE PROCESSING. (Book) http://www.cs.colorado.edu/~martin/slp.html - [5] Bloom filter http://en.wikipedia.org/wiki/Bloom_filter a space-efficient probabilistic data structure used to test whether an element is a member of a set. Applicable to spell checking. Implementations: - GNU Aspell. In C++. http://aspell.net/ Internals of Aspell: http://mail.python.org/pipermail/python-list/2002-October/168982.html Manual: http://aspell.net/man-html/index.html - Can't beat Jazzy. - Java's spell checker API. http://www.ibm.com/developerworks/java/library/j-jazzy/ - is based upon Aspell algorithm. a) Phonetic matching algorithms - soundex algorithms - Metaphone algorithm b) String matching algorithms - Dynamic programming (Levenshtein algorithm) - Simplified version of Google's spell corrector in Python. http://www.norvig.com/spell-correct.html http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html - use of substitution/insertion/deletion/transpose plus bayes model. - A spell checking engine in C++ http://www.codeproject.com/cpp/spellchecker_mg.asp Performance not good enough. Restricted to English. - VB.NET spell check: http://www.dart.com/rsweb.aspx Used in ASP.NET application. $249 - ASP.NET spell checker control: http://www.telerik.com/products/aspnet/controls/spell/overview.aspx Used in ASP.NET application. $199 - Suggester - Spell Checking Java library http://softcorporation.com/products/spellcheck/ Free. - More in JSP, PHP etc.