Ling
431/631: Corpus Linguistics
Ben
Bergen
Meeting 4: Quantitative Methods
September 10, 2007
Preliminaries
Questions
about Lab 2?
One
terminological/conceptual issues before we move on: the distinction between tokens
and types
·
Types are categories
and tokens are instances of the categories.
·
For instance, this
sentence has fourteen word tokens, which represent thirteen different word
types.
·
Types can be
graphical word forms (like repudiates), lexemes (like repudiate - which would include repudiates, repudiating, etc.), parts of speech, etc.
Today:
what sorts of statistical tools are used to quantitatively analyze a corpus?
Descriptive
statistics
Descriptive
statistics are used to summarize
collections of data in clear and understandable ways.
·
Common descriptive
statistics: mode, median, variance, range, standard deviation.
·
Can't tell
you the likelihood that the tendencies in your data are due to chance.
o
If you took this
sentence as a corpus, you would find that the word corpus occurs about one time out of every seventeen
words, and is more frequent that a,
I, or are.
o
You probably share
the intuition that this trend does not generalize to English in general, but
how a frequency count can't tell us this.
Inferential
statistics
Inferential
statistics: tests you
apply to quantitative data to determine the likelihood that the results you
observe are due to chance, and thus whether or not they can be generalized to
the larger population from which your sample was drawn.
One of the
most common questions one asks when looking at a corpus is whether frequency
differences are statistically significant - that is, whether the likelihood
that they arose due to chance is below a predetermined threshold of acceptability.
E.g., you are
looking at two texts and want to know whether a particular word is
significantly more frequent in one that in the other. For example, compare the
frequency of variants of repudiate in the corpus from Lab 2 (18/16,858) with that in the BNC
(450/100,106,089). The proportion seems much higher in the Lab 2 corpus, and
you can set up the data in a contingency table, like so (notice that it's the
raw frequencies, and not ratios that are placed in these tables:
|
|
repudiate |
other
words |
|
Lab3 |
18 |
16,840 |
|
BNC |
450 |
100,105,639 |
In order to
test whether this difference could be due to chance, we use either the
Chi-Square test or Fisher's Exact Test. Chi-Square is used more widely, but it's
actually less accurate when you have small numbers in each cell. So Fisher's
Exact is preferred (as long as you have only four cells.) Using an online
Fisher's Exact calculator (http://www.matforsk.no/ola/fisher.htm),
we get:
TABLE = [ 18 , 16840 , 450 , 100105639 ]
Left : p-value = 1
Right
: p-value = 5.654227284676502e-39
2-Tail : p-value = 5.654227284676502e-39
The 2-tail p-value
is what we're
interested in. P
stands for probability, and tells us what the likelihood is of this
distribution (actually a more unlikely distribution) arising due to chance. In
this case, the P-value is about one in 5 with 29 zeroes after it - not very
likely.
However, if
you compared the frequency of the in the two corpora (and if the frequencies are 1,155 and
6,589,967), you would find no signif. diff. - the odds are 1 in 9 that a more
extreme result would occur by chance, and this is usually not deemed sufficient
for significance. (The usual threshold is 1/20, or p<0.05)
TABLE = [ 1155 , 15705 , 6589967 , 93918372 ]
Left : p-value = 0.9433046886079023
Right
: p-value = 0.06046226289712684
2-Tail : p-value = 0.11971781383139754
You can do
similarly compare different words across corpora. Say relief occurs 29 times in the Lab 2 corpus
and 6,757 times in the BNC. Is there a difference in the use of relief and repudiate across the corpora?
|
|
repudiate |
relief |
|
Lab3 |
18 |
29 |
|
BNC |
450 |
6757 |
We get: TABLE = [ 18 , 29 ,
450 , 6757 ]
Left : p-value = 0.9999999999782237
Right
: p-value = 2.162780077365129e-10
2-Tail : p-value = 2.162780077365129e-10
Chi-Square
works similarly to Fisher's Exact, and should be used in circumstances where:
(1) you have large numbers of observations in each cell and (2) you have more
than two features or corpora you want to compare across (i.e. a cross table
larger than 2x2). Note that Chi-Square cannot tell you what individual
differences are significant when you compare more than 2x2, but only whether
there's an overall difference from expected values. http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html
Collocates
Suppose
now you want to know about co-occurrence statistics of sets of words. How often
does a particular word X occur with a particular span of words from word Y,
compared to its likelihood to appear in the corpus as a whole?
I
won't bother you with the math, but see a very accessible introduction here: http://www2.lael.pucsp.br/corpora/association/calc.htm
Mutual
Information expresses the
difference between the observed frequency of Y in the environment of X and the
predicted frequency of Y in those same positions based on its frequency in the
corpus as a whole as a ratio:
f(X,Yobserved)/f(X,Yexpected)
Big
differences indicate massive divergence from the null hypothesis that there is
no effect of X on the occurrence of Y, and indicate that X is exerting a strong
influence over its lexical environment.
But
mutual information fails when frequencies are very low. Suppose that a word,
like aardvark, occurs only
three times in a pretty large corpus, and one of those times, it happens to be
next to goes (as in
"Ethel the Aardvark goes quantity surveying"), not a particularly
unlikely possibility. Thus, it turns out that aardvark occurs much more frequently near goes than would be predicted on the basis of its
frequency in the corpus as a whole. That's why you need a second-order
statistic, which takes frequency into account.
The
T-score is such a measure.
Roughly, it tells you how confident you can be that the association between X
and Y is true and not due to the vagaries of chance. A high T-score says it is
safe (very safe/pretty safe/extremely secure etc according to value) to claim
that there is some non-random association between these two words; t-scores are
higher when frequency of Y is higher
What you will need to compute these two
statistics are five things:
·
Frequency of node,
f(n)
·
Frequency of
collocate, f(c)
·
Frequency of node
and collocate within span, f(n,c)
·
Window / span /
horizon (in words on each side of node)
·
Size of corpus, N
So
imagine that you're looking at the word relief in the Lab3 corpus and you want to know whether
the word effort occurs more
frequently in the preceding or following word than would be expected, given the
frequency of effort in the
corpus in general (11). Then you have:
·
f(relief) = 29;
f(effort) = 11; f(relief,effort) = 5; Window = 1; N = 16,858
Mutual
information and T-score can be calculated using this web interface: http://www2.lael.pucsp.br/corpora/association/calc.htm.
Entering the data above tells us:
T-Score: 2.22760545899436 (min. acceptable =
2)
Mutual Information: 8.04566124349092 (min. acceptable = 3)
So
there's probably something real going on with relief and efforts.
More
statistics
There's
a lot more that can be done statistically with corpora, but these are the basic
tools you will be most likely to need. Chapter 3 has a pretty good discussion
of multivariate statistics, and everything you want to know about statistics in
general (or links to it) can be found here.