Ling
431/631: Corpus Linguistics
Ben
Bergen
Meeting 8: Syntax (2)
October 8, 2007
Today
- Term projects
- Collostructions
- Mid-term evaluation
Term projects
Term projects
are scheduled as due to be presented on Dec 7th, and submitted on Dec 10th.
Appropriate topics
- building a new, useful corpus of some
language, language variety, register, etc.
- investigating a question of theoretical interest
(syntax, semantics, morphology, phonology, sociolinguistics, acquisition,
discourse, etc.) using an existing corpus
- norming stimuli (for experimentation,
elicitation, etc.) using a corpus
- using a corpus for an applied linguistic
purpose, such as preparation of language instruction materials or
lexicography
The term paper
- (Since you're going to ask anyway:) It will
probably be 10-20 pages. Much less or much more and I'll be suspicious.
- It's can take the form of a research paper or
a project description. In any case, you should include documentation of
what you did, and a clear description of what it's good for.
You will also
need to submit a term project proposal
- 1-3 pages
- When should it be due? How about
Nov. 1 (after the main linguistic content in the class)? Is this early
enough? Too early?
- Meet with me this week or next to talk about your
project: My office hours: M 2-3, T10-11.
Questions
- You can work on a project with someone else,
but each of you will be required to put in as much work as you would on a
singly-authored project. And it will have to be twice as good.
- You can turn in your proposal early, but
you're unlikely to get feedback before having to resubmit it unless you
give it to me really soon.
Collostructions
There are
significant associations among words that co-occur in a language - collocations. (We saw statistical means to measure
these.) Are there significant associations between words and grammatical
constructions?
Constructions: Meaning-bearing linguistic
structures of any level of abstractness, whose meaning is not compositionally
derived.
- words (barbarian, mussel)
- morphemes (-s, -ish)
- fully-fixed expressions (tit
for tat, Bob's your uncle)
- variable idioms (be given
to X, hoisted by X's own petard)
- partially filled
constructions (X Y's way Z, the Xer the Yer)
- abstract grammatical
constructions (ditransitive, transitive)
Collostructional
analysis "starts with a particular construction and investigates which
lexemes are strongly attracted or repelled by a particular slot in the
construction (i.e. occur more frequently or less frequently than
expected)" (p. 6)
- collexemes: lexemes attracted to a particular slot in a
construction
- collostruct: construction associated with a particular
lexeme
Methodology:
- Start from a construction:
define it, including slots, and search for n instances in the corpus
- Pare these down (manually,
in part) to include only those that are instances of the target
construction
- Calculate the independent
frequencies of (1) the collexeme and (2) the collostruct, (3) their joint
frequency, and (4) the total corpus size minus all instances of either

- Perform a Fisher Exact test
on these numbers, yielding a measure of the association between the
collexeme and collostruct
- Do this for all potential
collexemes, and order them by strength of association.
Questions
- In a collostructional
analysis, you have to separate attracted and repelled collexemes. How and why do you do this?
- Where do you draw the line
for significant collexemes?
- Footnote 3 says: "For
the moment, we will only consider as repelled items those which do occur,
but occur less frequently than expected, although it would of course also
be possible to include items that should have occurred on statistical
grounds, but did not." Explain this.
- Do you really go through all
the instances of the given construction by hand?
- Do you have to run a
separate Fisher Exact test for each collexeme? Isn't there a better way?