Ling
431/631: Corpus Linguistics
Ben
Bergen
Meeting 10: CHILDES
October 22, 2007
CHILDES
The CHIld
Language Date Exchange System is a free online collection of transcribed
interactions between children and adults, intended to be shared for child
language acquisition research.
The
collection:
- Lots of English, but also children learning
Mandarin, Cantonese, Korean, Japanese, Basque, Estonian, Farsi, Greek,
Hebrew, Sosotho, etc. [Researchers are encouraged to contribute, so the
collection is constantly in flux, and the list in the paper is outdated.]
- Archives can include just transcripts, or may
also have audio and/or video.
- The whole database is available online, or
can be downloaded. We have some in the LAE Labs.
- In addition to child language, it now
includes adult language under the rubric of Talkbank
Transcripts
contributed to CHILDES are encoded using the CHAT notational scheme, which
includes tags for a variety of things. The sample from the article:
@Begin
@Languages: en
@Participants:
ROS Ross Child BRI Brian Father
*ROS: why
isn't Mommy coming?
%com:
Mother usually picks Ross up around 4PM
*BRI:
don't worry.
*BRI:
she'll be here soon.
*ROS:
good.
@End
You can
probably figure out what all the special characters are for.
In addition,
there are the usual array of more advanced features you would like to have in a
corpus:
- File headers: a set of 24 standard file
headers such as "Age of Child," "Birth of Child,"
"Participants," "Location," and "Date" that
document a variety of facts about the participants and the recording.
- Tone units: a system for marking tone units,
pauses, and contours.
- Scoping: a scoping convention to indicate
stretches of overlaps, metalinguistic reference, retracings, and other
complex patterns.
- Dependent tiers: definitions for 14 coding
tiers. Coding for three of these dependent tiers have been worked out in
detail.
- Phonological coding: uses
Unicode to allow for direct input of IPA phonological characters.
- Error coding: provides a
full system for coding speech errors.
- Morphemic coding: a system
for morphemic and syntactic coding or interlinear glossing.
The CLAN
system [Computerized Languange ANalysis] is a suite of tools for analyzing
language data tagged in the CHAT format. It can be downloaded from the same
site.
It includes
functions for:
- Calculating frequency of words or morphemes
[FREQ]
- Calculating mean length of utterance [MLU]
- Finding combinations of words in the same
utterance [COMBO]
- Producing concordances [KWAL]
- And many others
Strengths
- Integrated and powerful
environment
- Affords lots of different
types of analysis.
Weaknesses
- Requires that data be coded
using the CHAT notation
- Uses a command-line
interface
- Is not as open-ended as
self-written tools can be