Ling
431/631: Corpus Linguistics
Ben
Bergen
Meeting 5: Parts of speech
September 17, 2007
This week, we're
going to focus on the use of part-of-speech-tagged corpora. In particular,
we're going to use a part of the British National Corpus (BNC).
Today, we'll
focus on two issues: (1) how to use part-of-speech tags in regular expressions
searches, and (2) what uses you can put them to.
Using
Regular Expressions on part of speech tags
Tags in the
BNC follow the format described on pages 2-3 below. Let's take a look at these
together.
[…time passes…]
Now that
you're familiar with the BNC part-of-speech tags, how can you define searches
to find, for example, the following things:
What can
you do with part of speech tags?
Here's just a
sample of things that are easier with POS tags.
If we have
time, we can brainstorm some other tasks.
BNC Tags
Tags
immediately precede the word or character they describe (with no intervening
space). They are enclosed in angled brackets: < >. Inside the brackets is
a letter stating whether the tag describes a segment (roughly a sentence),
word, or character (s, w, or c), followed by a space, and then the content of
the tag.
Here's a
sample sentence:
<s n=11><w NN1>Difficulty
<w VBZ>is <w VBG>being <w VVN>expressed <w PRP>with
<w AT0>the <w NN1>method <w TO0>to <w VBI>be <w
VVN>used <w TO0>to <w VVI>launch <w AT0>the <w
NN1>scheme<c PUN>.
The content
of ach tag consists of three characters. Generally, the first two characters
indicate the general part of speech, and the third character is used to
indicate a subcategory. When the most general, unmarked category of a part of
speech is indicated, in general the third character is 0. (For example, AJ0 is
the tag for the most general class of adjectives.)
|
AJ0 |
Adjective
(general or positive) (e.g. good, old, beautiful) |
|
AJC |
Comparative
adjective (e.g. better, older) |
|
AJS |
Superlative
adjective (e.g. best, oldest) |
|
AT0 |
Article
(e.g. the, a, an, no) [N.B. no is included among articles, which are defined here as
determiner words which typically begin a noun phrase, but which cannot occur
as the head of a noun phrase.] |
|
AV0 |
General
adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often,
well, longer (adv.),
furthest. |
|
AVP |
Adverb
particle (e.g. up, off, out) [N.B. AVP is used for such "prepositional adverbs",
whether or not they are used idiomatically in a phrasal verb: e.g. in Come
out here the AVP
tag is used for out. |
|
AVQ |
Wh-adverb
(e.g. when, where, how, why, wherever) [For either interrogative or relative use.] |
|
CJC |
Coordinating
conjunction (e.g. and, or, but) |
|
CJS |
Subordinating
conjunction (e.g. although, when) |
|
CJT |
The
subordinating conjunction that [when it introduces not only a nominal clause, but also a relative
clause, as in the day that follows Christmas. |
|
CRD |
Cardinal
number (e.g. one, 3, fifty-five, 3609) |
|
DPS |
Possessive
determiner (e.g. your, their, his) |
|
DT0 |
General
determiner: i.e. a determiner which is not a DTQ. [Here a determiner is
defined as a word which typically occurs either as the first word in a noun
phrase, or as the head of a noun phrase. E.g. This is tagged DT0 both in This is my
house and in This
house is mine.] |
|
DTQ |
Wh-determiner
(e.g. which, what, whose, whichever) [The category of determiner here is defined as for
DT0 above. Whether they occur in interrogative use or in relative use.] |
|
EX0 |
Existential
there, i.e. there
occurring in the there
is ... or there
are ...
construction |
|
ITJ |
Interjection
or other isolate (e.g. oh, yes, mhm, wow) |
|
NN0 |
Common
noun, neutral for number (e.g. aircraft, data, committee) [N.B. Singular collective nouns
such as committee and
team are tagged
NN0, on the grounds that they are capable of taking singular or plural
agreement with the following verb: e.g. The committee disagrees/disagree.] |
|
NN1 |
Singular
common noun (e.g. pencil, goose, time, revelation) |
|
NN2 |
Plural
common noun (e.g. pencils, geese, times, revelations) |
|
NP0 |
Proper noun
(e.g. London, Michael, Mars, IBM) [No distinction between singular and plural] |
|
ORD |
Ordinal
numeral (e.g. first, sixth, 77th, last) . [In a nominal or adverbial role. Next and last, as "general ordinals",
are also assigned to this category.] |
|
PNI |
Indefinite
pronoun (e.g. none, everything, one [as pronoun], nobody) [This tag applies to words that
always function as [heads of] noun phrases. Words like some and these, which can also occur before a noun
head in an article-like function, are tagged as determiners (see DT0 and AT0
above).] |
|
PNP |
Personal
pronoun (e.g. I, you, them, ours) [Possessive pronouns like ours and theirs are tagged as PNP] |
|
PNQ |
Wh-pronoun
(e.g. who, whoever, whom) [Whether they occur in interrogative or in relative use.] |
|
PNX |
Reflexive
pronoun (e.g. myself, yourself, itself, ourselves) |
|
POS |
The
possessive or genitive marker 's or '
(e.g. for Peter's or somebody else's, the sequence of tags is: NP0 POS CJC PNI AV0 POS) |
|
PRF |
The
preposition of. |
|
PRP |
Preposition
(except for of)
(e.g. about, at, in, on, on behalf of, with) |
|
PUL |
Punctuation:
left bracket - i.e. ( or [ |
|
PUN |
Punctuation:
general separating mark - i.e. . , ! , : ; - or ? |
|
PUQ |
Punctuation:
quotation mark - i.e. ' or " |
|
PUR |
Punctuation:
right bracket - i.e. ) or ] |
|
TO0 |
Infinitive
marker to |
|
UNC |
Unclassified
items which are not appropriately classified as items of the English lexicon.
[Including non-English words, special typographical symbols, formulae, and
(in spoken language) hesitation fillers such as er and erm.] |
|
VBB |
The present
tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative] |
|
VBD |
The past
tense forms of the verb BE: was and were |
|
VBG |
The -ing form
of the verb BE: being |
|
VBI |
The
infinitive form of the verb BE: be |
|
VBN |
The past
participle form of the verb BE: been |
|
VBZ |
The -s form
of the verb BE: is, 's |
|
VDB |
The finite
base form of the verb BE: do |
|
VDD |
The past tense
form of the verb DO: did |
|
VDG |
The -ing
form of the verb DO: doing |
|
VDI |
The
infinitive form of the verb DO: do |
|
VDN |
The past
participle form of the verb DO: done |
|
VDZ |
The -s form
of the verb DO: does, 's |
|
VHB |
The finite base
form of the verb HAVE: have, 've |
|
VHD |
The past
tense form of the verb HAVE: had, 'd |
|
VHG |
The -ing
form of the verb HAVE: having |
|
VHI |
The
infinitive form of the verb HAVE: have |
|
VHN |
The past
participle form of the verb HAVE: had |
|
VHZ |
The -s form
of the verb HAVE: has, 's |
|
VM0 |
Modal
auxiliary verb (e.g. will, would, can, could, 'll, 'd) |
|
VVB |
The finite
base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and
present subjunctive] |
|
VVD |
The past
tense form of lexical verbs (e.g. forgot, sent, lived, returned) |
|
VVG |
The -ing
form of lexical verbs (e.g. forgetting, sending, living, returning) |
|
VVI |
The
infinitive form of lexical verbs (e.g. forget, send, live, return) |
|
VVN |
The past
participle form of lexical verbs (e.g. forgotten, sent, lived, returned) |
|
VVZ |
The -s form
of lexical verbs (e.g. forgets, sends, lives, returns) |
|
XX0 |
The
negative particle not or n't |
|
ZZ0 |
Alphabetical
symbols (e.g. A, a, B, b, c, d) |