Ling 431/631: Corpus Linguistics

Ben Bergen


Meeting 13: Encoding schemes and part-of-speech tagging

November 26, 2007


General




Encoding


There are a lot of different character encoding standards. If you're using something akin to the English system [Roman characters], one of the standards, like ISO Latin-1 will work [that's usually the default]. But for other systems, like IPA or other text systems, unicode is the way to go.


Unicode [also known as ISO 10646]

You use unicode just like you would any other standard. For instance, in TextStat, if a file you add to your corpus is in unicode, you can search for characters in it by typing (or pasting) them into the pattern definition. In Word, certain fonts, like Arial Unicode MS support unicode, and you can use them to create texts in unicode for your corpus if you like.





Part-of-speech tagging

We built a primitive part-of-speech tagger, that worked under the conditions that

In principle, for some languages these assumptions might be reasonable. [Like Esperanto?] For most languages, they are not. So part-of-speech taggers have to be a little more sophisticated than the ones you built.

Some decisions:

If rule-based, phrase-structural [probably not] or dependency-based [probably]?

The basic structure of part-of-speech taggers:

The upshot is that the more ambiguity, the more it helps to have a hand-tagged corpus to use as the basis for automation of components of tagging.