Ling
431/631: Corpus Linguistics
Ben
Bergen
Meeting 2: Design considerations and corpus
types
August 27, 2007
Sampling
and representativeness
Some basic statistical
terminology:
· Sample:
a subset of the population to be studied (in our case, the corpus)
We
must ensure the sample is representative of the population, or we might get a
faulty picture of it.
· Random sampling is ideal, and typical in the physical and social
sciences - representativeness is ensured by randomly selecting members of the
population to study. We can do this in principle if we define our target
population tightly enough and have access to all of it.
o For example, if we're interested in studying all
plays published in American English in the 1990s, there's a good chance that
with a lot of work we could find the entire population, and then randomly
select texts of a given size from it.
o However, we often can't sample randomly; as
recording and encoding of language data takes lots of resources and targeted
effort.
· Instead, we often make sure the corpus is roughly
equivalent to the target population using demographic sampling. This is how surveys often work - given that you
know what percentage of the population is female, you can ensure that your
survey has roughly the same percentage of female respondents. For us, the
relevant demographics fall along a number of dimensions:
o author. We can't learn about English from looking just at Dickens, so instead
we take data from many writers/speakers
o sociolinguistic variables (age, gender, register, domain) can be controlled
for by selecting a demographically valid set of speakers/writers (see BNC,
where they gave people tape recorders, and also included a broad set of
domains, like interviews and presentations)
o variety (are you interested in just normative speech? just one dialect? multiple
ones?). One approach, advocated by Biber, uses strata - types of language (like
fiction, newspapers, speeches, etc.) and then samples within them. This is
somewhat controversial because it involves the analyst's impression of what the
strata are, but it's probably beneficial overall, especially for purposes of
transparency.
o medium (spoken, written, read, other?) The BNC has roughly 10% spoken
material.
Size
How
big should a corpus be, to be a representative sample? It depends on what you
want to do with it.
· For lexicography or language teaching/learning,
you just need as many instances of whatever item it is you're looking for as
suits your purpose. E.g., maybe you need 50 instances of a noun to know what it
does. This implies a very large corpus if you're interested in infrequent
words. (In the BNC, which has 100 million words, recluse occurs 100 times - i.e. in a representative subset
of a million words, it would occur about once.)
· If you want to use the corpus for scientific
purposes - e.g. determining the absolute or relative frequency of a word,
morpheme, phoneme, collocation, sentence pattern, etc, in the language, or
norming stimuli for experiments, then you need to make sure your sample is
statistically representative for whatever it is you're investigating. You
figure out your required sample size using two factors - the standard deviation
of some feature and the tolerable error for that feature, as described on pps.
80-81 of the book.
· The solution most large corpus builders adopt,
since they can't predict exactly what uses the corpus will be put to, combines
three considerations:
o Broad sampling - sample from as many different
varieties, media, etc. as possible
o Balancing - not skewing the contents in one way or
another
o Big - more words isn't always better, but it tends
to help
· The dispersion of linguistic features throughout the corpus is a
critical measure. A given word, construction, etc., that is used 20 times in a
corpus tells you something quite different when those 20 instances are all in
the same text, versus when they are distributed across texts. We'll look at how
to calculate dispersion later in the semester.
· A final question is whether sampling should be
based on production or perception frequencies. E.g., should you calculate the
percentage of your corpus to be made up of sitcoms on the basis of what
percentage of the total language produced they constitute, or on the basis of
how much time on average people spend watching them? Again, it depends on what
you're using the corpus for.
Beyond
the text
Corpora
are most useful when they include annotative metadata - data about the data.
· The encoding format. For languages that do not use
the standard ISO-646 set (that's English), you need to declare what set you are
using, so that the end user can interpret the data. Unicode seems to be the
best overall solution - it's a unified and inclusive character set up in
principle for all writing systems. It also has an IPA subset. We'll look more
at Unicode later in the term.
· For each text, anything that might be useful about
it, including the dimensions listed above. This extra-textual information is
often included in a header,
much like the header in an html document
<text>
<header author="Bob" sex="male" title="Things I put
in my nose">
<body>Hi, my
name is Bob. Here are some things I put in my nose: peas, pencils, ...
</body>
</text>
· For each word, the part of speech. A
part-of-speech tagged corpus might look like this:
One_CRD of_PRF the_AT0 great_AJ0 cliches_NN2
of_PRF the_AT0 last_ORD few_DT0 months_NN2 was_VBD that_DT0 Sept._NP0 11_CRD changed_VVD
everything_PNI ._. [...]
There
are lots of other possible formats, but essentially there are three choices to
make:
o Which annotation format to use
§ CLAWS, Penn, BNC
§ For translatability and easy access, use one of
these standards and modify/expand as necessary (and document modifications)
o What annotation scheme to use
§ /(), _, -, <>, &, etc.
§ You want to be able to easily recover the original
text (i.e. remove the tags)
o How to tag the corpus: by hand (for the masochist)
or automatically (later this semester)
· Anything else, including lemmas, parses,
semantics, phonetics, prosody
Whatever you do, document it. Make sure it's clear who collected what data, how it was sampled, how many words there are, what the tags mean, how they were assigned, and so on. Anyone who uses the corpus (that includes you) will need to know all this, and you will inevitably forget.