Linguistics 431/631: Connectionist language modeling

Ben Bergen

 

Meeting 12: Syntax 2

November 14, 2006

 

Syntax

 

Weve seen how simple grammatical class groupings can be learned through abstraction.

 

But theres more to syntax than just this.

      Syntax involves structural relations

o      Structural units can have internal structure (e.g. Nps)

o      Some relations are context dependent, like argument-structure and thematic roles.

      Syntax is productive new word sequences are formed all the time (although less than many people think)

 

Weve seen how a distributed system can model the grammatical category acquisition, but how does it deal with the other two?

 

Localist syntax

 

One way to solve these issues it using a localist syntax

      Assign a single node or set of nodes to each grammatical category (like agent, modifier, etc.)

      Also assign nodes to grammatical relations (like daughter of, subject, etc.)

      Bind together the categories and relations through co-activation

o      Binding can be effected through nodes that detect co-activation of two nodes

o      Or by having nodes fire together in phase

 

Problems with this approach

      How do you represent sentences that have multiple instances of a particular grammatical category or relation?

o      You could multiply these nodes, but then the localist perspective would imply hard constraints on the number of such roles and relations in a given utterance.

o      The problem with this is that these constraints seem to be soft relative clause embedding, for example, works better with particular structures than with others

       The cat the dog the mouse saw chased ran away.

       The planet the astronomer the university hired saw exploded.

 


Distributed syntax

 

The basic idea

      These are models of language use, where the system learns to perform some syntactic behavior. This is seen as advantageous since

o      It does not assume that any particular linguistic analysis is right.

o      It does not assume that the correct categories and relations are available in the real world for the learner to latch on to.

      They automatically construct their own categorizations and uses for grammatical categories and relations

 

Elman (1991) presents one such model

      Like the Elman (1990) model, it used a recurrent network architecture to predict the next work on the basis of the current word (and context).

      The aim was to see whether a network like this could automatically identify and learn context-sensitive grammatical relations

 

Architecture is just like Elman (1990) except for the number of nodes in each level and the presence of additional hidden layers between the inputs and hidden, and between the hidden and output.

 

Stimuli

      Composed of sentences

      Built from a set of 23 words

      Each word is encoded as a 26-bit vector

 

Some properties that were in evidence in the stimuli

 

Subject-verb agreement

      In all main clauses (John sees the cat.)

      In relative clauses, where relevant (John sees cat who walks.)

 

Verb argument structure

      Some intransitive verbs (walk, live)

      Some transitive verbs (hit, feed)

      So (in)transitive verbs (see, hear)

 

Recursion

      Relative clauses are embeddable inside NPs, which they also can contain

 

Linguists usually agree that these phenomena require linguistic structure other than simple word order, in order to be accounted for.

 


Training

 

There were four phases of training, of increasing complexity (in each, 10,000 sentences, 5 passes)

1.     Only simple sentences

2.     25% complex and 75% simple

3.     50% complex and 50% simple

4.     75% complex and 25% simple

 

Interesting note about starting small. Elman used this phased presentation strategy because he had previously found that the network would not learn when presented with all the data at once.

 

Results

 

Learned well - error rate of 0.17 (from each word's probability at a given point, not absolute error).

      Verbs agree with their subjects in simple sentences

      The network predicts appropriate continuations to subject-verb sequences

      It correctly predicts the agreement of verbs in relative clauses, like Boys who Mary chases feed cat


Acquiring anaphora

 

Frank et al. (ms) trained a SRN much like Elman's to predict whether object (him, her) or reflexive anaphoric pronouns (himself, herself) would appear in certain sentence positions.

 

 


 

Anaphoric pronouns are tricky because their interpretation depends on structure, not just linear word order

      John told Tom to kiss him.

      John told Tom to hiss himself.

      John, who likes Tom, kissed himself.

      John, who likes Tom, kissed him.

 

The network got sentences like these and had to predict which word would come next (like Elman 1990, 1991)

 

In general, it did a good job of predicting the next wed, but did so using different internal representations (based on hierarchical clustering analysis) than humans are supposed to.

 

So what does this mean?