Ling
431/631: Corpus Linguistics
Ben
Bergen
Meeting 1: Introduction to corpus linguistics
August 20, 2007
Early
corpus studies
Corpora
(though not always called that) were used broadly in language studies before
the cognitive revolution in the 1950s
· The great dictionaries of the 18th century (and
beyond) were constructed around large collections (in the millions) of uses of
words, e.g. Samuel Johnson's dictionary and the OED
· Pedagogical uses included building of teaching
grammars and translation tools
· Language documenters who live in the field for a
long period of time or who work from oral histories or other texts implicitly
use a similar method, though often on a smaller scale and a more ad hoc basis
However,
working with large bodies of data was extremely time-consuming and inefficient.
Moreover, it was hard to collect data in rigorous ways.
The
cognitive revolution
The
linguistic revolution led by Chomsky in the 1950s attacked behaviorism and with
it the methods of corpus linguistics. The main critiques were:
Some critiques helped
the corpus linguistics of the day improve. Others were not valid.
Attacks on corpus
linguistics continue, but have lost coherence (Chomsky 2004):
Corpus linguistics
doesn’t mean anything. It’s like saying suppose a physicist decides, suppose
physics and chemistry decide that instead of relying on experiments, what
they’re going to do is take videotapes of things happening in the world and
they’ll collect huge videotapes of everything that’s happening and from that
maybe they’ll come up with some generalizations or insights. Well, you know,
sciences don’t do this. But maybe they’re wrong. Maybe the sciences should just
collect lots and lots of data and try to develop the results from them.[...] So
if results come from study of massive data, rather like videotaping what’s
happening outside the window, fine - look at the results. I don’t pay much
attention to it. I don’t see much in the way of results. [...]
(W)e learn more
about language by following the standard method of the sciences. The standard
method of the sciences is not to accumulate huge masses of unanalyzed data and
to try to draw some generalization from them. The modern sciences, at least
since Galileo, have been strikingly different. What they have sought to do was
to construct refined experiments which ask, which try to answer specific
questions that arise within a theoretical context as an approach to
understanding the world. [...]
There’s no lack
of empiricism if you design your inquiries as essentially experiments which
seek to discover answers to questions that are arising in a theoretical
framework. [...] (Y)ou set up an experiment. An experiment is called work with an
informant, in which you design questions that you ask to the informant to
elicit data that will bear on the questions that you’re investigating, and will
seek to provide evidence that will help you answer these questions that are
arising within a theoretical framework. Well, that’s the same kind of thing
they do in the physics department or the chemistry department or the biology
department. [...]
This argument betrays
an incorrect grasp of scientific practice and principle:
Corpus
linguistics provides a viable, practical, and informative set of tools for a
broad range of scientific and other linguistic purposes.
The
technological revolution
Corpus
research has been bolstered immensely by the growing pervasiveness and power of
computational systems and electronic bodies of text. A modern corpus has
certain basic properties
What
a corpus is like beyond this will depend on what it's being used for. Corpora
can vary along a number of dimensions:
1. Type of data (acoustic recordings, transcribed
speech, written language)
2. Type of speakers (children, adults, first or
second language users)
3. Languages or varieties (just one language or
variety, or several, for comparison)
4. Size (small corpora can be used to answer limited
types of questions, and may be as small as tens of thousands of words; large
corpora can be used for a broader range of uses and may be as large as millions
or 100s of millions of words)
5. Time (synchronic or diachronic)
There
are many corpus resources available online, importantly including CHILDES (for
first and second language acquisition). There will be a list of links at the
course website soon. The web can also be used as a resource, within limits, as
we will see on Thursday.
Corpora
that we have copies of in the LAE General Lab include those below. To use the labs,
you need to sign up to be a lab member. You can start this process if you are
not yet a member by following the instructions on the 'for lab users' link
here: http://www.ling.hawaii.edu/lae/.
(All students and faculty in the college of LLL are eligible, others can be
given special dispensation if enrolled in this or other relevant courses).
LAE
Labs corpora
|
Name |
Language |
Type |
|
English |
Microphone
speech |
|
|
English |
Telephone
speech |
|
|
English |
Conversation
text |
|
|
English,
Dutch, German |
Varied
lexicon |
|
|
English |
Varied
text |
|
|
English |
Microphone
speech |
|
|
English |
Varied
text |
|
|
English |
Varied
text |
|
|
Chinese |
Newswire
text |
|
|
Japanese |
Telephone
speech |
|
|
Japanese |
Conversation
text |
|
|
Japanese |
Newswire
text |
|
|
Korean |
Newswire
text |
|
|
English |
Telephone
speech |
|
|
English |
Telephone
speech |
|
|
English |
Varied
speech and text |
|
|
Various
languages |
Transcripts
of naturalistic speech of children |
|
|
Corpus of Spoken Professional
American English tagged version |
English |
Conversation
text |
|
ICAME (International Computer Archive
of Modern and Medieval English) |
English |
Written,
spoken, historical, and parsed corpora |
|
ICE-GB Sample(The British Component
of the International Corpus of English) |
English |
Varied
text |
|
English |
Varied
speech |
|
|
Varied
speech |
||
|
Various
languages |
Database |
|
|
ANC First
Release |
English
(American) |
|
|
CALLHOME
German Speech |
German |
|
|
Chinese
Gigaword |
Chinese |
|
|
Korean
Telephone Conversations Speech |
Korean |
|
|
SLX
Corpus of Classic Sociolinguistic Interviews |
English |
|
|
21st
Century SEJONG Project |
Korean |
|