Hawaiian Spell Checker Design Created on: 7/5/2007, Last modified: 7/6/2007 Table of contents 1. Implementation phase 2. Workflow 3. Dictionary design 4. Morphological string matching algorithm design 5. Phonetical string matching algorithm design 6. Gramatical analysis algorithm design 7. Environment wrapper design 8. Consideration on the Hawaiian language itself References 1. Implementation phase 1) Create a dictionary. The dictionary should contain a static part containing the central part of the language, and a dynamic part that corresponds to the added words and those customly allowed by the user. These two can be combined, but separate the two may simplify the task. 2) Implement efficient searching in the dictionary. A fast implementation in sorted list can be used here. Later if insertion/deletion are to be considered, this part can be refactored. 3) Implement string matching algorithm. 4) Implement phonetic matching algorithm. 5) Incoporate it into the environment (web/windows etc). 6) Implement gramatical check if available. 2. Workflow First need to break the input text into words or phrases. Then for each of these, do the following: WORD --> search --> found (finish) | | not found V get morphological/phonetical similarities | | V search, and list hits by rank. (finish) Note: here addresses are the terms to search. Thus whether to break down into each individual word or into an address phrase needs consideration. 3. Dictionary design The dictionary will not be a static one, because from time to time new words, or special words (acronyms etc.) that the user needs will be inserted. Hawaiian by itself is expanding its vocabulary. But it is also true that the address space is fixed. There will be insertion/deletion, but not frequent. The dictionary can use a static part and a dynamic, customized part. But that just complicates the situation. So I prefer a single part. The dictionary should support efficient search/insert/delete. A dictionary can be implemented as a balanced binary tree (AVL, red-black, or splay), or a hash table, or a B-tree for large data. In this case of Hawaiian, it's not a big vocabulary, but find a good hashing function may not be easy. So a balanced binary search tree is used. The search function comes with the data structure. The dictionary will be read into memory when the program starts, then new words may be inserted into the tree, finally, it is written back to disk when the program exits. 4. Morphological string matching algorithm design 1-letter insertion/deletion/substitution and 2-letter exchange are the most common cases of mis-spelling. Other variations are possible but need much more computation. The rank of a similar word can be calculated using the dynamic programming algorithm (Levenshtein algorithm). 5. Phonetical string matching algorithm design Although the Hawaiian language pronounciation is unlike that of English, it is highly regular. With the help of Hawaiian specialists, a soundex grouping of Hawaiian can be constructed. Each word will have a soundex encoding. The phonetical string matching algorithm then find all those words with the same and similar soundex encoding words. The rank can be calculated by the Levnshtein algorithm. 6. Gramatical analysis algorithm design Some grammatical check can be added later to increase the accuracy. 7. Environment wrapper design First the implementation is done is C/C++. Two central files are needed: the engine executable file and the dictionary data file. The engine will be written as a COM object, that can be used in any COM enabled environment (VB, VC, .NET, ASP). One caution is that we cannot read in the dictionary for each word processed. We need the program to stay in the memory like a server. For the reason, the COM object can be implemented as a windows process that loads the dictionary when started, and stay there to process incoming request. At this time I don't know if the COM object will be implemented as a server directly, or can be wrapped in a .NET windows process dll file. 8. Consideration on the Hawaiian language itself References See the references of the survey.