Ling 431/631: Corpus Linguistics

Ben Bergen

 

Meeting 1: Introduction to corpus linguistics

August 20, 2007

 

Early corpus studies

 

Corpora (though not always called that) were used broadly in language studies before the cognitive revolution in the 1950s

 

·      The great dictionaries of the 18th century (and beyond) were constructed around large collections (in the millions) of uses of words, e.g. Samuel Johnson's dictionary and the OED

·      Pedagogical uses included building of teaching grammars and translation tools

·      Language documenters who live in the field for a long period of time or who work from oral histories or other texts implicitly use a similar method, though often on a smaller scale and a more ad hoc basis

 

However, working with large bodies of data was extremely time-consuming and inefficient. Moreover, it was hard to collect data in rigorous ways.

 

The cognitive revolution

 

The linguistic revolution led by Chomsky in the 1950s attacked behaviorism and with it the methods of corpus linguistics. The main critiques were:

 

  1. Corpus investigations address performance rather than competence
  2. Frequency tells you about the world rather than about language (I live in New York is doubtless more frequent than I live in Dayton, Ohio)
  3. You can’t know what the circumstances of corpus collection were
  4. A corpus is finite but language is infinite
  5. Corpus research is slow and limited
  6. Corpus leaves out what you don’t say, which can be more informative than what you do say.

 

Some critiques helped the corpus linguistics of the day improve. Others were not valid.

 

  1. Even if we assume a competence-performance distinction (which we might not), performance is still an inherently valid object of study! Moreover, the criticism applies to all language data, because it is all performance data, even grammaticality judgments. In any case, in order to avoid true performance errors, corpora need to be assembled carefully.
  2. World frequency is a good reason to use extremely large and well-balanced corpora (which has been possible since the information revolution).
  3. Corpora are now collected in extremely systematic and controlled ways.
  4. The finite-infinite question is a non-issue - there are an infinite number of possible songs, but this doesn't stop us from studying music through actual composed or performed music. It is true that no corpus will ever cover every possible utterance in a given language, so corpora are not sufficient for a complete vision of the human language capacity, but a good 100 million word corpus, will have a lot utterances like any one you could come up with.
  5. Corpus linguistics using note cards was slow and limited. Corpus linguistics using concordance software and scripting languages is fast, efficient, and in principle boundless.
  6. Corpus analysis will never tell you that an utterance is impossible. But with a large enough and well enough balanced corpus and sufficient statistical tools, it can tell you when it is statistically significant for such an utterance to be absent from the corpus. Moreover, corpora can often tell us more about impossible utterances than human judges can. Consider the example cited in the textbook. Chomsky claimed that the verb perform cannot be used with mass nouns - you can perform a task but not perform labor, and argued that he knew this as a native speaker of English. But of course if you check out any corpus, you'll find that perform can be used with mass nouns; you can perform music, surgery, magic, etc.

 

Attacks on corpus linguistics continue, but have lost coherence (Chomsky 2004):

 

Corpus linguistics doesn’t mean anything. It’s like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights. Well, you know, sciences don’t do this. But maybe they’re wrong. Maybe the sciences should just collect lots and lots of data and try to develop the results from them.[...] So if results come from study of massive data, rather like videotaping what’s happening outside the window, fine - look at the results. I don’t pay much attention to it. I don’t see much in the way of results. [...]

 

(W)e learn more about language by following the standard method of the sciences. The standard method of the sciences is not to accumulate huge masses of unanalyzed data and to try to draw some generalization from them. The modern sciences, at least since Galileo, have been strikingly different. What they have sought to do was to construct refined experiments which ask, which try to answer specific questions that arise within a theoretical context as an approach to understanding the world. [...]

 

There’s no lack of empiricism if you design your inquiries as essentially experiments which seek to discover answers to questions that are arising in a theoretical framework. [...] (Y)ou set up an experiment. An experiment is called work with an informant, in which you design questions that you ask to the informant to elicit data that will bear on the questions that you’re investigating, and will seek to provide evidence that will help you answer these questions that are arising within a theoretical framework. Well, that’s the same kind of thing they do in the physics department or the chemistry department or the biology department. [...]

 

This argument betrays an incorrect grasp of scientific practice and principle:

 

  1. Entire fields of science use exclusively or almost exclusively observational data: astronomy and astrophysics, atmospheric science, archeology, paleontology, subfields of ocean science and biology, etc. In these fields, you (a) observe, (b) build models, which (c) make predictions, and then (d) collect more observations to (e) see how the models do. It's called the scientific method.
  2. By contrast, elicitation with a single subject, no matter how controlled and well designed the questions are, fails as a viable experiment: (a) It has a sample size of 1. (this is only valid if the population of language speakers is approximately 1, in which case you are studying the entire population); (b) It asks for judgments, which are only valid if you are studying judgments (much like studying a car's engine by watching the engine light only); (c) it's an extremely impoverished method, as it can't tell you much if anything about child language, learner’s speech, speech errors, casual speech, social variables, frequencies, etc.

 

Corpus linguistics provides a viable, practical, and informative set of tools for a broad range of scientific and other linguistic purposes.

 

The technological revolution

 

Corpus research has been bolstered immensely by the growing pervasiveness and power of computational systems and electronic bodies of text. A modern corpus has certain basic properties

 

  1. Sampling and representativeness. The texts in a corpus must be collected in a systematic way, under controlled conditions, and in such a way that the corpus reflects the true distribution of the language/dialect/variety under study.
  2. Machine readability. Originally, corpora were prepared by hand, on index cards placed in filing cabinets, but to scale up and save time, a corpus should be electronic and properly annotated.

 

What a corpus is like beyond this will depend on what it's being used for. Corpora can vary along a number of dimensions:

 

1.     Type of data (acoustic recordings, transcribed speech, written language)

2.     Type of speakers (children, adults, first or second language users)

3.     Languages or varieties (just one language or variety, or several, for comparison)

4.     Size (small corpora can be used to answer limited types of questions, and may be as small as tens of thousands of words; large corpora can be used for a broader range of uses and may be as large as millions or 100s of millions of words)

5.     Time (synchronic or diachronic)

 

Resources at our disposal

 

There are many corpus resources available online, importantly including CHILDES (for first and second language acquisition). There will be a list of links at the course website soon. The web can also be used as a resource, within limits, as we will see on Thursday.

 

Corpora that we have copies of in the LAE General Lab include those below. To use the labs, you need to sign up to be a lab member. You can start this process if you are not yet a member by following the instructions on the 'for lab users' link here: http://www.ling.hawaii.edu/lae/. (All students and faculty in the college of LLL are eligible, others can be given special dispensation if enrolled in this or other relevant courses).

 

 

 


LAE Labs corpora

 

Name

Language

Type

Boston University Radio Speech Corpus

English

Microphone speech

CALLHOME American English Speech

English

Telephone speech

CALLHOME American English Transcripts

English

Conversation text

CELEX2

English, Dutch, German

Varied lexicon

DSO Corpus of Sense-tagged English

English

Varied text

TIMIT Acoustic-Phonetic Continuous Speech Corpus

English

Microphone speech

Treebank-2

English

Varied text

Treebank-3

English

Varied text

Chinese Treebank Final Release

Chinese

Newswire text

CALLHOME Japanese Speech

Japanese

Telephone speech

CALLHOME Japanese Transcripts

Japanese

Conversation text

Japanese Business News Text

Japanese

Newswire text

Korean Newswire

Korean

Newswire text

1998 HUB5 English Evaluation

English

Telephone speech

Switchboard-2 Phrase III Audio

English

Telephone speech

British National Corpus (BNC) World Edition

English

Varied speech and text

Child Language Data Exchange System (CHILDES) 2001

Various languages

Transcripts of naturalistic speech of children

Corpus of Spoken Professional American English tagged version

English

Conversation text

ICAME (International Computer Archive of Modern and Medieval English)

English

Written, spoken, historical, and parsed corpora

ICE-GB Sample(The British Component of the International Corpus of English)

English

Varied text

IViE (Intonational Variation in English).

English

Varied speech

ToBI Corpora

English Japanese Korean

Varied speech

UCLA Speech Error Corpus

Various languages

Database

ANC First Release

English (American)

 

CALLHOME German Speech

German

 

Chinese Gigaword

Chinese

 

Korean Telephone Conversations Speech

Korean

 

SLX Corpus of Classic Sociolinguistic Interviews

English

 

21st Century SEJONG Project

Korean