Corpus Linguistics

Linguistics 431 & 631

Ben Bergen

Mon, Wed 3:30-4:45

 

 

Course summary

 

This course is an introduction to the use of corpora in the study of language. A corpus is a large collection of language data that can be the empirical basis for a broad range of applications, like:

 

á      Describing languages (whether they be well or under-documented)

á      Testing predictions made by linguistic theories (syntactic, morphological, phonetic, etc.)

á      Balancing stimuli for use in experiments

 

But certain expertise is required to interact fruitfully with a corpus.

 

á      What corpus resources exist, how many we have access to at UH (answer: a lot!), and where they can be found and used.

á      What different types of corpus exist, including full acoustic records, part-of-speech tagged corpora, syntactically parsed corpora, sociolinguistically tagged corpora, etc

á      If there isn't a corpus for a language (or a question), you need to build one. But what can go into a corpus, and how do you make sure it will allow you to do the things with it you want?

á      Different tools for searching through corpora allow you to investigate different sorts of questions - everything from the frequency of individual words or morphemes to the range of noun phrases that can be the subjects of a particular class of verbs. What search tools allow you to investigate what sorts of questions, and how do you use those tools?

á      Once you have searched through a corpus, what you need to apply tests to tell if what you found is statistically significant. Which ones are appropriate when, and how do you perform them?

 

Approach

 

The approach is extremely hands-on. The typical week will include a seminar-style meeting on Monday and an applied lab meeting on Wednesday. Students will learn both theory about the construction and use of corpora and also the applied details of how to use existing corpora and corpus search tools, with special emphasis on those we have available in the LAE labs, which include English, Chinese, Japanese, German, and Korean corpora.

 

Who this course is for

 

This course is intended for students with little computer knowledge. It is open to advanced undergraduate students and graduate students with an interest in language, in any discipline. No programming experience is required.


Assignments and Evaluation

 

Students have four responsibilities:

 

Class participation entails not mere physical presence and production of sound, but engaged, constructive, critical, informed, and concise contributions to class discussion. This can be done orally, in class, or by email, if you're less comfortable speaking up. All course meetings are mandatory. I understand that things come up, so in the absence of catastrophic events, up to three absences are excusable (though students are still responsible for the content of missed meetings). Additional absences will decrease the participation grade.

 

Each lab meeting will be structured around a lab assignment. The aim is for these to be completable during scheduled class time by the fastest students, but will likely often take longer for more slow-and-steady workers. I will of course be available during the lab to answer any questions you have, and you are also encouraged to work collaboratively Ð while each student should execute all the required operations him or herself, I suggest you discuss with your neighbors particularly useful solutions, strategies, etc. These lab assignments will be collected one week after they're assigned, and together will form a sort of personalized instruction manual for corpus linguistics.

 

Coursework will culminate in a research project. This is your opportunity to conduct corpus-based research pertaining to your own research interests. A non-comprehensive list of types of projects you might choose:

 

 

Grades will be assigned according to the following scheme:

 

10%     Participation

50%     Lab assignments

40%     Research project

 

There is no curve.

 

 

Access

 

I will hold weekly office hours, at times to be determined. My office is 581 Moore Hall. You can also contact me by email, at bergen@hawaii.edu.

 

Lecture notes, an up-to-date course schedule, links to online versions of course readings, and links to relevant resources will appear through the semester at http://www2.hawaii.edu/~bergen/corpus/


Schedule (provisional, and subject to revision)

 

Part I: Creating and using corpora

Date

Topic

Reading, things due

8.20

Introduction to corpus linguistics

 

8.22

Lab 1: Corpus test drive

M&W ch. 1; [1]

8.27

Design considerations and corpus types

M&W ch. 2; [2]

8.29

Regular expressions searches

[3]

9.3

Labor Day (No class)

 

9.5

Lab 2: Regular expressions ... [corpus]

Lab 1 due

9.10

Quantitative methods

M&W ch 3

9.12

Lab 3: Quantitative methods

[4]; Lab 2 due

 

 

 

Part II: Applications

9.17

Part-of-speech tags

M&W ch 4

9.19

Lab 4: Part-of-speech tags

Lab 3 due

9.24

Morphology

 

9.26

Lab 5: Morphology

Lab 4 due

10.1

Syntax 1

[5]

10.3

Lab 6: Syntax 1

Lab 5 due

10.8

Syntax 2

[6]

10.10

Lab 7: Syntax 2

Lab 6 due

10.15

Framenet

[7]

10.17

Lab 8: Framenet

Lab 7 due

10.22

Language acquisition

[8]

10.24

Lab 9: CHILDES

Lab 8 due

10.29

Norming experimental stimuli

[9]

10.31

Lab 10: Norming

Lab 9 due

 

 

 

Part III: Computational techniques

11.5

The most useful thing you will ever learn, ever

[10]

11.7

Lab 11: Python 1

[11]; Lab 10 due

11.12

Veteran's Day (No class)

 

11.14

Lab 12: Python 2

[12]

11.19

Lab 13: Python 3

[13]

11.21

Lab 14: Python 4

[14]

11.26

Encoding and tagging

M&W ch 5.1-5.3; [15]

11.28

Lab 15: Concordancing

Labs 11-14 due

12.3

Lab 16: Your research project

 

12.5

Last day of class

Lab 15 due

12.7

 

Research project presentations

12.10

 

Research project due

 


Readings and resources:

 

The textbook

Tony McEnery and Andrew Wilson. 2001. Corpus Linguistics, 2nd ed., Edinburgh UP. Cost from the bookstore (or Amazon) is $29 new, $22.05 used.

Supplementary material: http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm

 

Online readings

[1]  Corpus methods from Anatol Stefanowitsch http://www-user.uni-bremen.de/~anatol/docs/corp_methods.pdf

[2]  Baker, J. P. and Hardie, A. and McEnery, A. M. and Xiao, R. Z. and Bontcheva, K. and Cunningham, H. and Gaizauskas, R. and Hamza, O. and Maynard, D. and Tablan, V. and Ursu, C. (2004) Languages: Corpus Creation and Tool Development. Literary and Linguistic Computing, 19 (4).  pp. 509-524. http://eprints.lancs.ac.uk/55/

[3]  Regular expressions from Anatol Stefanowitsch http://www-user.uni-bremen.de/~anatol/docs/corp_regex.pdf

[4]  Statistical tests from Anatol Stefanowitsch: http://www-user.uni-bremen.de/~anatol/qnt/qnt_dist.html

[5]  Arnold, J. E., Wasow, T., Losongco, T., & Ginstrom, R. (2000). Heaviness vs. Newness: the effects of structural complexity and discourse status on constituent ordering. Language. 76(1), 28-55. http://www.unc.edu/~jarnold/papers/Arnold,Wasow,Losongco,Ginstrom2000.pdf

[6]  Stefanowitsch, Anatol, and Stefan Th. Gries. (2003). Collostructions: investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8.2: 209-243. http://www-user.uni-bremen.de/~anatol/docs/ms_collostructions.pdf

[7]  Framenet - read 'The Book' http://framenet.icsi.berkeley.edu/

[8]  The Childes System, by Brian MacWhinney: http://childes.psy.cmu.edu/intro/childes.pdf

[9]  Gahl, S., Jurafsky, D., & Roland, D. 2004. Verb subcategorization frequencies: American English corpus data, methodological studies, and cross-corpus comparisons. Behavior Research Methods, Instruments, & Computers, 36, 432-443. http://www2.hawaii.edu/~bergen/corpus/gahl.jurafsky.roland.frq.04.pdf

[10]      Python - get it: http://wiki.python.org/moin/BeginnersGuide/Download and learn how to use it: http://www.hetland.org/python/instant-hacking.php

[11]      Basic python programming: http://www.zacharski.org/python2/Python2.pdf

[12]      More python programming: http://www.zacharski.org/python2/Python3.pdf

[13]      Python practice: http://www.zacharski.org/python2/Python4.pdf

[14]      Using python for corpus work (Penn Treebank) http://freshmeat.net/articles/view/1617/

[15]      Unicode: http://www.unicode.org/standard/WhatIsUnicode.html and http://www.unicode.org/standard/principles.html

 

 

General corpus linguistics resources

1.     Stanford link, esp. part-of-speech taggers http://www-nlp.stanford.edu/links/statnlp.html

2.     CHILDES http://childes.psy.cmu.edu/intro/childes.pdf

3.     CLAN http://childes.psy.cmu.edu/manuals/CLAN.pdf

4.     English Lexicon Project http://elexicon.wustl.edu/default.asp

5.     Do-It-Yourself Corpus Linguistics http://www.geocities.com/SoHo/Square/3472/program.html

6.     Computational resources for linguistics research  http://billposer.org/Linguistics/Computation/index.html

7.     Creating language data resources from the LDC http://www.ldc.upenn.edu/Creating/

8.     The corpora newsgroup http://torvald.aksis.uib.no/corpora/

9.     Word lists: http://torvald.aksis.uib.no/corpora/wordlists.html

 

Other corpus linguistics classes

1.     Rob Malouf's class on Computational Corpus Linguistics: http://www-rohan.sdsu.edu/~malouf/ling571.html#lectures

2.     Emily Bender's courses on corpus linguistics and links: http://faculty.washington.edu/ebender/

 

Python resources

1.     Python homepage: http://www.python.org/

2.     Re (regular expressions) module in Python http://www.amk.ca/python/howto/regex/

3.     Python for linguists: http://www.zacharski.org/python/