ICS 661, Advanced Artificial Intelligence

Assignment 2

Download the hunpos open source HMM tagger and train it on an 80% portion of the CRATER million word multilingual annotated corpus. I recommend the English files (english000.html and english001.html), but if you are semi-fluent in French or Spanish, you are welcome to try those instead. Note that hunpos expects the training corpus as one word or punctuation symbol per line followed by a tab and then its part-of-speech (see the UserManual under the Wiki on the hunpos site). You will have to transform and clean the CRATER data to make it suitable for input to hunpos. Note that hunpos expects an empty line to mark the end of every sentence (the end-of-sentence in the CRATER files is a </p> end of paragraph HTML tag).

We will do a 5-fold cross validation, where we split the total data into 5 approximately equal portions and use 4/5 for training and 1/5 for testing. We will repeat this 5 times, each time with a different 1/5 portion used for testing. What is the error rate for each case? What is the average error rate and the standard deviation in error rates?

The default training for hunpos is for trigrams. Switch to bigrams (-t 1) and repeat the 5-fold cross validation, reporting all 5 error rates and the average and standard deviation of the error rates. Which works better for the CRATER data, bigrams or trigrams? Why do you think this is the case?

To transform the CRATER files, you want to take only those lines in bold (search for ^<b>.*</b>$), which are the raw words and those lines beginning in <TT> and ending in </TT>, which are the p-o-s. Note that some raw words have been combined into a single lemma in the CRATER data. For example, Ccitt-Defined appears as 3 bold words, Ccitt - Defined, but as a single italic lemma and has only one p-o-s, JJ. You will have to add appropriate p-o-s for each of these words. For example, you could add the p-o-s - for the - and JJ for Ccitt. Don't worry about absolute correctness of your p-o-s labeling. Being consistent is more important.

To transform and clean the CRATER data, I recommend grep on Linux/Mac along with a text editor that has regular expressions like emacs or vi. On Windows, Microsoft Word has regular expressions (not free). There are also many free downloadable text editors for Windows that handle regular expressions. You can also download Cygwin or MinGW to get UNIX tools like grep for Windows. To compare the original hand-tagged test sets with the hunpos-tagged test sets, I recommend diff with wc on Linux/Mac and WinMerge on Windows (the number of differences is shown on the bottom right).

Name your training data files as english1.corpus, english2.corpus, ..., english5.corpus where each file contains a different 4/5 of the combined and cleaned english000.html and english001.html files (or french1.corpus or spanish1.corpus if you are using those). Name your trained bigram models english1bi.model, english2bi.model, etc. and your trained trigram models english1tri.model, etc. Name your test sets english1test.corpus, etc. and the results of tagging by hunpos english1tagged.corpus. Zip together all of these files along with your report (PDF, ascii text, or MS Word .doc or .docx are all fine) for submitting on Laulima.

For more information about how hunpos works, read this paper from the Proceedings of the ACL 2007 Demo and Poster Sessions.

David N. Chin / Chin@Hawaii.Edu