Ling 431/631: Corpus Linguistics

Ben Bergen

 

Meeting 2: Design considerations and corpus types

August 27, 2007

 

Sampling and representativeness

 

Some basic statistical terminology:

·      Sample: a subset of the population to be studied (in our case, the corpus)

 

We must ensure the sample is representative of the population, or we might get a faulty picture of it.

·      Random sampling is ideal, and typical in the physical and social sciences - representativeness is ensured by randomly selecting members of the population to study. We can do this in principle if we define our target population tightly enough and have access to all of it.

o      For example, if we're interested in studying all plays published in American English in the 1990s, there's a good chance that with a lot of work we could find the entire population, and then randomly select texts of a given size from it.

o      However, we often can't sample randomly; as recording and encoding of language data takes lots of resources and targeted effort.

·      Instead, we often make sure the corpus is roughly equivalent to the target population using demographic sampling. This is how surveys often work - given that you know what percentage of the population is female, you can ensure that your survey has roughly the same percentage of female respondents. For us, the relevant demographics fall along a number of dimensions:

o      author. We can't learn about English from looking just at Dickens, so instead we take data from many writers/speakers

o      sociolinguistic variables (age, gender, register, domain) can be controlled for by selecting a demographically valid set of speakers/writers (see BNC, where they gave people tape recorders, and also included a broad set of domains, like interviews and presentations)

o      variety (are you interested in just normative speech? just one dialect? multiple ones?). One approach, advocated by Biber, uses strata - types of language (like fiction, newspapers, speeches, etc.) and then samples within them. This is somewhat controversial because it involves the analyst's impression of what the strata are, but it's probably beneficial overall, especially for purposes of transparency.

o      medium (spoken, written, read, other?) The BNC has roughly 10% spoken material.

 

Size

 

How big should a corpus be, to be a representative sample? It depends on what you want to do with it.

·      For lexicography or language teaching/learning, you just need as many instances of whatever item it is you're looking for as suits your purpose. E.g., maybe you need 50 instances of a noun to know what it does. This implies a very large corpus if you're interested in infrequent words. (In the BNC, which has 100 million words, recluse occurs 100 times - i.e. in a representative subset of a million words, it would occur about once.)

·      If you want to use the corpus for scientific purposes - e.g. determining the absolute or relative frequency of a word, morpheme, phoneme, collocation, sentence pattern, etc, in the language, or norming stimuli for experiments, then you need to make sure your sample is statistically representative for whatever it is you're investigating. You figure out your required sample size using two factors - the standard deviation of some feature and the tolerable error for that feature, as described on pps. 80-81 of the book.

·      The solution most large corpus builders adopt, since they can't predict exactly what uses the corpus will be put to, combines three considerations:

o      Broad sampling - sample from as many different varieties, media, etc. as possible

o      Balancing - not skewing the contents in one way or another

o      Big - more words isn't always better, but it tends to help

·      The dispersion of linguistic features throughout the corpus is a critical measure. A given word, construction, etc., that is used 20 times in a corpus tells you something quite different when those 20 instances are all in the same text, versus when they are distributed across texts. We'll look at how to calculate dispersion later in the semester.

·      A final question is whether sampling should be based on production or perception frequencies. E.g., should you calculate the percentage of your corpus to be made up of sitcoms on the basis of what percentage of the total language produced they constitute, or on the basis of how much time on average people spend watching them? Again, it depends on what you're using the corpus for.

 

Beyond the text

 

Corpora are most useful when they include annotative metadata - data about the data.

·      The encoding format. For languages that do not use the standard ISO-646 set (that's English), you need to declare what set you are using, so that the end user can interpret the data. Unicode seems to be the best overall solution - it's a unified and inclusive character set up in principle for all writing systems. It also has an IPA subset. We'll look more at Unicode later in the term.

·      For each text, anything that might be useful about it, including the dimensions listed above. This extra-textual information is often included in a header, much like the header in an html document

<text> <header author="Bob" sex="male" title="Things I put in my nose">

<body>Hi, my name is Bob. Here are some things I put in my nose: peas, pencils, ...

                        </body> </text>

·      For each word, the part of speech. A part-of-speech tagged corpus might look like this:

 

One_CRD of_PRF the_AT0 great_AJ0 cliches_NN2 of_PRF the_AT0 last_ORD few_DT0 months_NN2 was_VBD that_DT0 Sept._NP0 11_CRD changed_VVD everything_PNI ._.  [...]

 

            There are lots of other possible formats, but essentially there are three choices to make:

o      Which annotation format to use

§       CLAWS, Penn, BNC

§       For translatability and easy access, use one of these standards and modify/expand as necessary (and document modifications)

o      What annotation scheme to use

§       /(), _, -, <>, &, etc.

§       You want to be able to easily recover the original text (i.e. remove the tags)

o      How to tag the corpus: by hand (for the masochist) or automatically (later this semester)

·      Anything else, including lemmas, parses, semantics, phonetics, prosody

 

Whatever you do, document it. Make sure it's clear who collected what data, how it was sampled, how many words there are, what the tags mean, how they were assigned, and so on. Anyone who uses the corpus (that includes you) will need to know all this, and you will inevitably forget.