Ling 431/631: Corpus Linguistics

Ben Bergen

 

Meeting 4: Quantitative Methods

September 10, 2007

 

Preliminaries

 

Questions about Lab 2?

 

One terminological/conceptual issues before we move on: the distinction between tokens and types

·          Types are categories and tokens are instances of the categories.

·          For instance, this sentence has fourteen word tokens, which represent thirteen different word types.

·          Types can be graphical word forms (like repudiates), lexemes (like repudiate - which would include repudiates, repudiating, etc.), parts of speech, etc.

 

Today: what sorts of statistical tools are used to quantitatively analyze a corpus?

 

Descriptive statistics

 

Descriptive statistics are used to summarize collections of data in clear and understandable ways.

 

·          Common descriptive statistics: mode, median, variance, range, standard deviation.

·          Can't tell you the likelihood that the tendencies in your data are due to chance.

o        If you took this sentence as a corpus, you would find that the word corpus occurs about one time out of every seventeen words, and is more frequent that a, I, or are.

o        You probably share the intuition that this trend does not generalize to English in general, but how a frequency count can't tell us this.

 

Inferential statistics

 

Inferential statistics: tests you apply to quantitative data to determine the likelihood that the results you observe are due to chance, and thus whether or not they can be generalized to the larger population from which your sample was drawn.

 

One of the most common questions one asks when looking at a corpus is whether frequency differences are statistically significant - that is, whether the likelihood that they arose due to chance is below a predetermined threshold of acceptability.

 

E.g., you are looking at two texts and want to know whether a particular word is significantly more frequent in one that in the other. For example, compare the frequency of variants of repudiate in the corpus from Lab 2 (18/16,858) with that in the BNC (450/100,106,089). The proportion seems much higher in the Lab 2 corpus, and you can set up the data in a contingency table, like so (notice that it's the raw frequencies, and not ratios that are placed in these tables:

 

                 

repudiate

other words

Lab3

18

16,840

BNC

450

100,105,639

In order to test whether this difference could be due to chance, we use either the Chi-Square test or Fisher's Exact Test. Chi-Square is used more widely, but it's actually less accurate when you have small numbers in each cell. So Fisher's Exact is preferred (as long as you have only four cells.) Using an online Fisher's Exact calculator (http://www.matforsk.no/ola/fisher.htm), we get:

 

TABLE = [ 18 , 16840 , 450 , 100105639 ]

Left   : p-value = 1

Right  : p-value = 5.654227284676502e-39

2-Tail : p-value = 5.654227284676502e-39

 

The 2-tail p-value is what we're interested in. P stands for probability, and tells us what the likelihood is of this distribution (actually a more unlikely distribution) arising due to chance. In this case, the P-value is about one in 5 with 29 zeroes after it - not very likely.

 

However, if you compared the frequency of the in the two corpora (and if the frequencies are 1,155 and 6,589,967), you would find no signif. diff. - the odds are 1 in 9 that a more extreme result would occur by chance, and this is usually not deemed sufficient for significance. (The usual threshold is 1/20, or p<0.05)

 

TABLE = [ 1155 , 15705 , 6589967 , 93918372 ]

Left   : p-value = 0.9433046886079023

Right  : p-value = 0.06046226289712684

2-Tail : p-value = 0.11971781383139754

 

You can do similarly compare different words across corpora. Say relief occurs 29 times in the Lab 2 corpus and 6,757 times in the BNC. Is there a difference in the use of relief and repudiate across the corpora?

 

 

repudiate

relief

Lab3

18

29

BNC

450

6757

 

We get:                                    TABLE = [ 18 , 29 , 450 , 6757 ]

Left   : p-value = 0.9999999999782237

Right  : p-value = 2.162780077365129e-10

2-Tail : p-value = 2.162780077365129e-10

 

Chi-Square works similarly to Fisher's Exact, and should be used in circumstances where: (1) you have large numbers of observations in each cell and (2) you have more than two features or corpora you want to compare across (i.e. a cross table larger than 2x2). Note that Chi-Square cannot tell you what individual differences are significant when you compare more than 2x2, but only whether there's an overall difference from expected values. http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html

 

 

Collocates

 

Suppose now you want to know about co-occurrence statistics of sets of words. How often does a particular word X occur with a particular span of words from word Y, compared to its likelihood to appear in the corpus as a whole?

 

I won't bother you with the math, but see a very accessible introduction here: http://www2.lael.pucsp.br/corpora/association/calc.htm

 

Mutual Information expresses the difference between the observed frequency of Y in the environment of X and the predicted frequency of Y in those same positions based on its frequency in the corpus as a whole as a ratio:

 

f(X,Yobserved)/f(X,Yexpected)

 

Big differences indicate massive divergence from the null hypothesis that there is no effect of X on the occurrence of Y, and indicate that X is exerting a strong influence over its lexical environment.

 

But mutual information fails when frequencies are very low. Suppose that a word, like aardvark, occurs only three times in a pretty large corpus, and one of those times, it happens to be next to goes (as in "Ethel the Aardvark goes quantity surveying"), not a particularly unlikely possibility. Thus, it turns out that aardvark occurs much more frequently near goes than would be predicted on the basis of its frequency in the corpus as a whole. That's why you need a second-order statistic, which takes frequency into account.

 

The T-score is such a measure. Roughly, it tells you how confident you can be that the association between X and Y is true and not due to the vagaries of chance. A high T-score says it is safe (very safe/pretty safe/extremely secure etc according to value) to claim that there is some non-random association between these two words; t-scores are higher when frequency of Y is higher

What you will need to compute these two statistics are five things:

·          Frequency of node, f(n)

·          Frequency of collocate, f(c)

·          Frequency of node and collocate within span, f(n,c)

·          Window / span / horizon (in words on each side of node)

·          Size of corpus, N

 

So imagine that you're looking at the word relief in the Lab3 corpus and you want to know whether the word effort occurs more frequently in the preceding or following word than would be expected, given the frequency of effort in the corpus in general (11). Then you have:

 

·          f(relief) = 29; f(effort) = 11; f(relief,effort) = 5; Window = 1; N = 16,858

 

Mutual information and T-score can be calculated using this web interface: http://www2.lael.pucsp.br/corpora/association/calc.htm. Entering the data above tells us:

 

T-Score: 2.22760545899436 (min. acceptable = 2)

Mutual Information: 8.04566124349092 (min. acceptable = 3)

 

So there's probably something real going on with relief and efforts.

 

More statistics

 

There's a lot more that can be done statistically with corpora, but these are the basic tools you will be most likely to need. Chapter 3 has a pretty good discussion of multivariate statistics, and everything you want to know about statistics in general (or links to it) can be found here.

 

http://davidmlane.com/hyperstat/index.html