Linguistics 431/631: Connectionist language modeling

Ben Bergen

 

Meeting 11: Syntax

November 9, 2006

 

Syntax

 

Elman (1991) identifies three types of grammatical model

      Symbolic approaches (GB, Cognitive Grammar, HPSG, etc.)

      Localist approaches essentially localist connectionist renderings of symbolic models

      Distributed models (e.g. Elman 1990, 1991)

      Well talk next week about strengths and weaknesses of all three. Well spend this meeting discussing the third type, because it is the most different.

 

Learning grammatical classes

 

Where do grammatical classes come from?

      It could be that syntactic categories are wired into the heads of individual language users

      However, there is no evidence for this, and the argument for it is circular and faulty:

o      We know there are innate grammatical classes because all languages have the same set that behave the same way, and the reason they do is because grammatical classes are innate.

o      In fact, not all languages have the same grammatical classes, which means that even this circular reasoning is faulty.

      A simpler hypothesis (one that does not assume an evolutionary mutation) is that grammatical categories are learned directly (and implicitly) from language data

      If all other things are equal, then by Occams razor, this second, empiricist hypothesis is correct

 

Elman (1990) decided to test what aspects of grammatical categories could be learned by a distributed recurrent connectionist network, on the sole basis of word order.

      He randomly created short (two or three word) sentences

      These sentences used 13 grammatical classes (Table 3)

      Each was constructed from one of 15 sentence templates (Table 4)

      10,000 sentences were randomly generated

      Each word was represented by a unique 31-bit sequence with one node on and all others off

      Inputs were concatenated into a single string - there were no breaks between sentences (Table 5)

      Network had 31 inputs and output nodes, and 150 hidden and context (fixed one-to-one) nodes.

      The task was to predict the next word on the basis of the current word (and the previous word)

      He had the network run through the entire data set 6 times

      To evaluate this network, since there may be multiple possible words in each position, the outputs have to be compared with the probabilities that a given word will occur in a particular position.

                                                                       

            

 

Results

      The network has an error of 0.05 when compared to the probabilities vector for a given position (pretty good), and has learned this without any prior grammatical knowledge.

      To investigate the internal representations of the words, he averaged all the hidden layer representations for each word (for the whole input run) into a single 150-bit string, and then subjected these to hierarchical clustering.

      This shows that the network has learned internal representation for word classes this representation is both hierarchical and graded.

 

Conclusions

      Of course, this model is simplified, because it is much like learning language from the radio context and meaning are only indirectly represented.

      But it goes to show that word order can be a good cue for grammatical class, and that grammatical classes can be extracted from this superficial property.


The rest of the semester                            

 

13

11.14

More Syntax

 

 

11.16

Computer lab: Syntax

 

14

11.21

Computer lab: Syntax II

R14

 

11.23

No class - Thanksgiving

 

15

11.28

The brain

R15

 

11.30

What is connectionism?

R16

16

12.5

Student presentations

 

 

12.7

Student presentations and wrap-up

 

 

12.12

Term project due