Corpus Linguistics
Linguistics 431 & 631
Ben Bergen
Mon, Wed 3:30-4:45
This course is an introduction to the use of corpora in the study of language. A corpus is a large collection of language data that can be the empirical basis for a broad range of applications, like:
á Describing languages (whether they be well or under-documented)
á Testing predictions made by linguistic theories (syntactic, morphological, phonetic, etc.)
á Balancing stimuli for use in experiments
But certain expertise is required to interact fruitfully with a corpus.
á What corpus resources exist, how many we have access to at UH (answer: a lot!), and where they can be found and used.
á What different types of corpus exist, including full acoustic records, part-of-speech tagged corpora, syntactically parsed corpora, sociolinguistically tagged corpora, etc
á If there isn't a corpus for a language (or a question), you need to build one. But what can go into a corpus, and how do you make sure it will allow you to do the things with it you want?
á Different tools for searching through corpora allow you to investigate different sorts of questions - everything from the frequency of individual words or morphemes to the range of noun phrases that can be the subjects of a particular class of verbs. What search tools allow you to investigate what sorts of questions, and how do you use those tools?
á Once you have searched through a corpus, what you need to apply tests to tell if what you found is statistically significant. Which ones are appropriate when, and how do you perform them?
The approach is extremely hands-on. The typical week will include a seminar-style meeting on Monday and an applied lab meeting on Wednesday. Students will learn both theory about the construction and use of corpora and also the applied details of how to use existing corpora and corpus search tools, with special emphasis on those we have available in the LAE labs, which include English, Chinese, Japanese, German, and Korean corpora.
This course is intended for students with little computer knowledge. It is open to advanced undergraduate students and graduate students with an interest in language, in any discipline. No programming experience is required.
Students have four responsibilities:
Class participation entails not mere physical presence and production of sound, but engaged, constructive, critical, informed, and concise contributions to class discussion. This can be done orally, in class, or by email, if you're less comfortable speaking up. All course meetings are mandatory. I understand that things come up, so in the absence of catastrophic events, up to three absences are excusable (though students are still responsible for the content of missed meetings). Additional absences will decrease the participation grade.
Each lab meeting will be structured around a lab assignment. The aim is for these to be completable during scheduled class time by the fastest students, but will likely often take longer for more slow-and-steady workers. I will of course be available during the lab to answer any questions you have, and you are also encouraged to work collaboratively Ð while each student should execute all the required operations him or herself, I suggest you discuss with your neighbors particularly useful solutions, strategies, etc. These lab assignments will be collected one week after they're assigned, and together will form a sort of personalized instruction manual for corpus linguistics.
Coursework will culminate in a research project. This is your opportunity to conduct corpus-based research pertaining to your own research interests. A non-comprehensive list of types of projects you might choose:
Grades will be assigned according to the following scheme:
10% Participation
50% Lab assignments
40% Research project
There is no curve.
|
Part I: Creating and
using
corpora |
||
|
Date |
Topic |
Reading, things due |
|
8.20 |
|
|
|
8.22 |
M&W ch. 1; [1] |
|
|
8.27 |
M&W ch. 2; [2] |
|
|
8.29 |
[3] |
|
|
9.3 |
Labor Day (No
class) |
|
|
9.5 |
Lab 1
due
|
|
|
9.10 |
M&W ch 3 |
|
|
9.12 |
[4]; Lab 2 due |
|
|
|
|
|
|
Part II:
Applications |
||
|
9.17 |
M&W ch 4 |
|
|
9.19 |
Lab 3 due |
|
|
9.24 |
|
|
|
9.26 |
Lab 4 due |
|
|
10.1 |
[5] |
|
|
10.3 |
Lab 5 due |
|
|
10.8 |
[6] |
|
|
10.10 |
Lab 6 due |
|
|
10.15 |
[7] |
|
|
10.17 |
Lab 7 due |
|
|
10.22 |
[8] |
|
|
10.24 |
Lab 8 due |
|
|
10.29 |
[9] |
|
|
10.31 |
Lab 9 due |
|
|
|
|
|
|
Part III:
Computational
techniques |
||
|
11.5 |
[10] |
|
|
11.7 |
[11]; Lab 10 due |
|
|
11.12 |
Veteran's Day (No class) |
|
|
11.14 |
[12] |
|
|
11.19 |
[13] |
|
|
11.21 |
[14] |
|
|
11.26 |
M&W ch 5.1-5.3; [15] |
|
|
11.28 |
Labs 11-14
due
|
|
|
12.3 |
Lab 16: Your
research
project
|
|
|
12.5 |
Lab 15 due |
|
|
12.7 |
|
Research project
presentations |
|
12.10 |
|
Research project
due |
Readings and resources:
Tony McEnery and
Andrew Wilson. 2001. Corpus
Linguistics, 2nd ed., Edinburgh UP.
Cost
from the bookstore (or
Amazon) is $29 new, $22.05 used.
Supplementary material: http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm
[1] Corpus methods from Anatol Stefanowitsch http://www-user.uni-bremen.de/~anatol/docs/corp_methods.pdf
[2] Baker, J. P. and Hardie, A. and McEnery, A. M. and Xiao, R. Z. and Bontcheva, K. and Cunningham, H. and Gaizauskas, R. and Hamza, O. and Maynard, D. and Tablan, V. and Ursu, C. (2004) Languages: Corpus Creation and Tool Development. Literary and Linguistic Computing, 19 (4). pp. 509-524. http://eprints.lancs.ac.uk/55/
[3] Regular expressions from Anatol Stefanowitsch http://www-user.uni-bremen.de/~anatol/docs/corp_regex.pdf
[4] Statistical tests from Anatol Stefanowitsch: http://www-user.uni-bremen.de/~anatol/qnt/qnt_dist.html
[5] Arnold, J. E., Wasow, T., Losongco, T., & Ginstrom, R. (2000). Heaviness vs. Newness: the effects of structural complexity and discourse status on constituent ordering. Language. 76(1), 28-55. http://www.unc.edu/~jarnold/papers/Arnold,Wasow,Losongco,Ginstrom2000.pdf
[6] Stefanowitsch, Anatol, and Stefan Th. Gries. (2003). Collostructions: investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8.2: 209-243. http://www-user.uni-bremen.de/~anatol/docs/ms_collostructions.pdf
[7] Framenet - read 'The Book' http://framenet.icsi.berkeley.edu/
[8] The Childes System, by Brian MacWhinney: http://childes.psy.cmu.edu/intro/childes.pdf
[9] Gahl, S., Jurafsky, D., & Roland, D. 2004. Verb subcategorization frequencies: American English corpus data, methodological studies, and cross-corpus comparisons. Behavior Research Methods, Instruments, & Computers, 36, 432-443. http://www2.hawaii.edu/~bergen/corpus/gahl.jurafsky.roland.frq.04.pdf
[10] Python - get it: http://wiki.python.org/moin/BeginnersGuide/Download and learn how to use it: http://www.hetland.org/python/instant-hacking.php
[11] Basic python programming: http://www.zacharski.org/python2/Python2.pdf
[12] More python programming: http://www.zacharski.org/python2/Python3.pdf
[13] Python practice: http://www.zacharski.org/python2/Python4.pdf
[14] Using python for corpus work (Penn Treebank) http://freshmeat.net/articles/view/1617/
[15] Unicode: http://www.unicode.org/standard/WhatIsUnicode.html and http://www.unicode.org/standard/principles.html
1. Stanford link, esp. part-of-speech taggers http://www-nlp.stanford.edu/links/statnlp.html
2. CHILDES http://childes.psy.cmu.edu/intro/childes.pdf
3. CLAN http://childes.psy.cmu.edu/manuals/CLAN.pdf
4. English Lexicon Project http://elexicon.wustl.edu/default.asp
5. Do-It-Yourself Corpus Linguistics http://www.geocities.com/SoHo/Square/3472/program.html
6. Computational resources for linguistics research http://billposer.org/Linguistics/Computation/index.html
7. Creating language data resources from the LDC http://www.ldc.upenn.edu/Creating/
8. The corpora newsgroup http://torvald.aksis.uib.no/corpora/
9. Word lists: http://torvald.aksis.uib.no/corpora/wordlists.html
1. Rob Malouf's class on Computational Corpus Linguistics: http://www-rohan.sdsu.edu/~malouf/ling571.html#lectures
2. Emily Bender's courses on corpus linguistics and links: http://faculty.washington.edu/ebender/
1. Python homepage: http://www.python.org/
2. Re (regular expressions) module in Python http://www.amk.ca/python/howto/regex/
3. Python for linguists: http://www.zacharski.org/python/