Linguistics 431/631: Connectionist language modeling
Ben Bergen
November 14, 2006
Syntax
WeÕve seen how simple grammatical class groupings can be learned through abstraction.
But thereÕs more to syntax than just this.
á Syntax involves structural relations
o Structural units can have internal structure (e.g. Nps)
o Some relations are context dependent, like argument-structure and thematic roles.
á Syntax is productive Ð new word sequences are formed all the time (although less than many people think)
WeÕve seen how a distributed system can model the grammatical category acquisition, but how does it deal with the other two?
Localist syntax
One way to solve these issues it using a localist syntax
á Assign a single node or set of nodes to each grammatical category (like agent, modifier, etc.)
á Also assign nodes to grammatical relations (like daughter of, subject, etc.)
á Bind together the categories and relations through co-activation
o Binding can be effected through nodes that detect co-activation of two nodes
o Or by having nodes fire together in phase
Problems with this approach
á How do you represent sentences that have multiple instances of a particular grammatical category or relation?
o You could multiply these nodes, but then the localist perspective would imply hard constraints on the number of such roles and relations in a given utterance.
o
The problem with this is that these constraints seem to
be soft Ð relative clause embedding, for example, works better with particular
structures than with others
¤
The cat the dog the mouse saw chased ran away.
¤ The planet the astronomer the university hired saw exploded.
Distributed syntax
The basic idea
á These are models of language use, where the system learns to perform some syntactic behavior. This is seen as advantageous since
o It does not assume that any particular linguistic analysis is right.
o It does not assume that the ÒcorrectÓ categories and relations are available in the real world for the learner to latch on to.
á They automatically construct their own categorizations and uses for grammatical categories and relations
Elman (1991) presents one such model
á Like the Elman (1990) model, it used a recurrent network architecture to predict the next work on the basis of the current word (and context).
á The aim was to see whether a network like this could automatically identify and learn context-sensitive grammatical relations
Architecture is just like Elman (1990) except for the number of nodes in each level and the presence of additional hidden layers between the inputs and hidden, and between the hidden and output.
Stimuli
á Composed of sentences
á Built from a set of 23 words
á Each word is encoded as a 26-bit vector
Some properties that were in evidence in the stimuli
Subject-verb agreement
á In all main clauses (John sees the cat.)
á In relative clauses, where relevant (John sees cat who walks.)
Verb argument structure
á Some intransitive verbs (walk, live)
á Some transitive verbs (hit, feed)
á So (in)transitive verbs (see, hear)
Recursion
á Relative clauses are embeddable inside NPs, which they also can contain
Linguists usually agree that these phenomena require linguistic structure other than simple word order, in order to be accounted for.
Training
There were four phases of training, of increasing complexity (in each, 10,000 sentences, 5 passes)
1. Only simple sentences
2. 25% complex and 75% simple
3. 50% complex and 50% simple
4. 75% complex and 25% simple
Interesting note about starting small. Elman used this phased presentation strategy because he had previously found that the network would not learn when presented with all the data at once.
Results
Learned well - error rate of 0.17 (from each word's probability at a given point, not absolute error).
á Verbs agree with their subjects in simple sentences
á The network predicts appropriate continuations to subject-verb sequences
á It correctly predicts the agreement of verbs in relative clauses, like Boys who Mary chases feed cat



Acquiring anaphora
Frank et al. (ms) trained a SRN much like Elman's to predict whether object (him, her) or reflexive anaphoric pronouns (himself, herself) would appear in certain sentence positions.

Anaphoric pronouns are tricky because their interpretation depends on structure, not just linear word order
á John told Tom to kiss him.
á John told Tom to hiss himself.
á John, who likes Tom, kissed himself.
á John, who likes Tom, kissed him.
The network got sentences like these and had to predict which word would come next (like Elman 1990, 1991)
In general, it did a good job of predicting the next wed, but did so using different internal representations (based on hierarchical clustering analysis) than humans are supposed to.
So what does this mean?