Linguistics 431/631: Connectionist language modeling

Ben Bergen


Meeting 11: Syntax

November 9, 2006




Elman (1991) identifies three types of grammatical model

      Symbolic approaches (GB, Cognitive Grammar, HPSG, etc.)

      Localist approaches essentially localist connectionist renderings of symbolic models

      Distributed models (e.g. Elman 1990, 1991)

      Well talk next week about strengths and weaknesses of all three. Well spend this meeting discussing the third type, because it is the most different.


Learning grammatical classes


Where do grammatical classes come from?

      It could be that syntactic categories are wired into the heads of individual language users

      However, there is no evidence for this, and the argument for it is circular and faulty:

o      We know there are innate grammatical classes because all languages have the same set that behave the same way, and the reason they do is because grammatical classes are innate.

o      In fact, not all languages have the same grammatical classes, which means that even this circular reasoning is faulty.

      A simpler hypothesis (one that does not assume an evolutionary mutation) is that grammatical categories are learned directly (and implicitly) from language data

      If all other things are equal, then by Occams razor, this second, empiricist hypothesis is correct


Elman (1990) decided to test what aspects of grammatical categories could be learned by a distributed recurrent connectionist network, on the sole basis of word order.

      He randomly created short (two or three word) sentences

      These sentences used 13 grammatical classes (Table 3)

      Each was constructed from one of 15 sentence templates (Table 4)

      10,000 sentences were randomly generated

      Each word was represented by a unique 31-bit sequence with one node on and all others off

      Inputs were concatenated into a single string - there were no breaks between sentences (Table 5)

      Network had 31 inputs and output nodes, and 150 hidden and context (fixed one-to-one) nodes.

      The task was to predict the next word on the basis of the current word (and the previous word)

      He had the network run through the entire data set 6 times

      To evaluate this network, since there may be multiple possible words in each position, the outputs have to be compared with the probabilities that a given word will occur in a particular position.





      The network has an error of 0.05 when compared to the probabilities vector for a given position (pretty good), and has learned this without any prior grammatical knowledge.

      To investigate the internal representations of the words, he averaged all the hidden layer representations for each word (for the whole input run) into a single 150-bit string, and then subjected these to hierarchical clustering.

      This shows that the network has learned internal representation for word classes this representation is both hierarchical and graded.



      Of course, this model is simplified, because it is much like learning language from the radio context and meaning are only indirectly represented.

      But it goes to show that word order can be a good cue for grammatical class, and that grammatical classes can be extracted from this superficial property.

The rest of the semester                            




More Syntax




Computer lab: Syntax




Computer lab: Syntax II




No class - Thanksgiving




The brain




What is connectionism?




Student presentations




Student presentations and wrap-up




Term project due