Ling 431/631: Corpus Linguistics
Ben Bergen
Meeting 13: Encoding schemes and part-of-speech tagging
November 26, 2007
General
Remaining questions on the Python labs?
Other linguistics-specific python resources
Rob Malouf's class on computational corpus linguistics [http://www-rohan.sdsu.edu/~malouf/ling571.html#lectures]
Natural Language Toolkit [http://nltk.sourceforge.net/index.php/Main_Page] includes the following software modules (over 50k lines of Python code):
Corpus readers: interfaces to many Corpora
Tokenizers: whitespace, newline, blankline, word, wordpunct, treebank, sexpr, regexp, Punkt sentence segmenter
Taggers: regexp, n-gram, backoff, Brill, HMM
Parsers: recursive descent, shift-reduce, chunk, chart, feature-based, probabilistic,
Semantic interpretation: untyped lambda calculus, first-order models, parser interface
Miscellaneous: feature detection, unification, chatbots, many utilities
Use it [and comment it] or lose it!
Encoding
There are a lot of different character encoding standards. If you're using something akin to the English system [Roman characters], one of the standards, like ISO Latin-1 will work [that's usually the default]. But for other systems, like IPA or other text systems, unicode is the way to go.
Unicode [also known as ISO 10646]
A "universal" character encoding standard that assigns one code to any character [e.g. U+00FC, which is 'ü']. Note that these codes define characters, not glyphs. You also need a unicode font [freely downloadable] and a program [like a browser or a word processing program] to convert the character representations into squiggles.
Has defined regions for different scripts, and lots of room to define your own for local use
Can be produced in recent versions of Word, OpenOffice, Netscape Composer, and others
Can be searched in TextSTAT
You use unicode just like you would any other standard. For instance, in TextStat, if a file you add to your corpus is in unicode, you can search for characters in it by typing (or pasting) them into the pattern definition. In Word, certain fonts, like Arial Unicode MS support unicode, and you can use them to create texts in unicode for your corpus if you like.
Part-of-speech tagging
We built a primitive part-of-speech tagger, that worked under the conditions that
All words of the language were known
Each fell into non-overlapping part of speech categories
There were no homonyms crossing parts of speech.
In principle, for some languages these assumptions might be reasonable. [Like Esperanto?] For most languages, they are not. So part-of-speech taggers have to be a little more sophisticated than the ones you built.
Some decisions:
How fine are the part-of-speech classes? Just four? Or all the ones in the BNC? Or more?
Linguistic rules or statistical generalizations [or both]?
If rule-based, phrase-structural [probably not] or dependency-based [probably]?
The basic structure of part-of-speech taggers:
First, check the lexicon, which has a part-of-speech tag [or set of tags] associated with wordforms. This can get very big, so it helps to generate this automatically, on the basis of a part-of-speech tagged corpus. Also, it might include multi-word words, like 'out of' or 'Super Bowl'.
If you don't find the word, perform a morphological analysis - check for endings and see whether stripping them yields a word that is in the lexicon and can take this ending.
If you do find the word, and it has a single part of speech associated with it, you're done.
If you find the word and it has multiple possible parts of speech, then you need to disambiguate.
Then disambiguate. Disambiguation usually uses a probability matrix design. Basically, if you have a large, part-of-speech tagged corpus of a language, you can calculate the probability that any given part of speech will occur in the neighborhood of any other part of speech, for example that a noun will occur after a determiner, or before a verb. Any grammatically ambiguous word has a matrix of probabilities associated with the others words around it [and their parts of speech].
The upshot is that the more ambiguity, the more it helps to have a hand-tagged corpus to use as the basis for automation of components of tagging.