Linguistics 431/631: Connectionist language modeling

Ben Bergen


Meeting 4: Learning in the brain and connectionist networks

September 12, 2006


Learning in the brain


How does learning happen in the brain?

      There is very little growth of new synapses or neurons in post-natal animals (though there's more and more evidence that this does occur, though too slowly for online learning)

      Instead, learning predominantly involves the reorganization of existing structures.

      It involves the strengthening and weakening of existing synapses


Two types of learning

      Classical conditioning: when a stimulus that gives rise to a response is associated with another stimulus, that second stimulus also gives rise to the response.

      Operant conditioning: a reinforcing or punishing event strengthens or weakens the connection between a stimulus and response.


Classical conditioning


Donald Hebb theorized (1949):

"When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased


This could explain how classical conditioning works.

      In classical conditioning, an Unconditioned Stimulus (US) previously gives rise to an unconditioned Response (R)

      A Conditioned Stimulus (CS), which does not normally give rise to the same response is presented alongside the US.

      This yields an association between the CS and the R.



The strengthening of connections between associated neurons is called Hebbian learning, and the mechanism for it seems to be a process called Long Term Potentiation.

Operant conditioning


But Hebbian learning cant be the only way the brain changes its structure we also learn on the basis of feedback.


In operant conditioning:

      A Response (R) precedes a Stimulus (S), which is either positive or negative.

      Depending on the polarity of the S, the R will become progressively more or less likely, even in the absence of the S.


Thorndike (1910):

Of several responses made to the same situation those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections to the situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond.


We see in operant conditioning that learning happens progressively over trials




Unfortunately, we really dont know how operant conditioning happens in the brain

      One leading hypothesis is that LTP also accounts for it as well as classical conditioning

      On this account, some LTP is mediated by connections from Ventral Tegmental Area (VTA), an evolutionarily ancient part of the brain thought to be responsible for the feeling of reward and pleasure

      The story goes that positive reinforcement results from dopamine-induced LTP

      But theres no story yet for negative conditioning.


Both types of learning (classical and operant) are implemented in connectionist systems, although the simulations were doing only use a type of operant conditioning.



Learning through backpropagation of error


The main way that connectionist models learn is through a method akin to operant conditioning, called backpropagation of error.



      Start with a network architecture.

      Assign (different) randomly assigned weights for each of the connections (these may be constrained to fall within some weight limit)

      There is a set of input/output data, which describes pairs of inputs and output activations for the input and output nodes, respectively.



      The goal of learning is to learn weights such that any input will produce the desired output.

      It is also usually a goal for new input stimuli not included in the input/output data to give rise to the desired output (i.e. generalization)


The backpropagation algorithm


First, an input/output pattern is selected, at random or sequentially.

      The input is sent into the network and this produces activation values for the output nodes.

      Because the weights were selected randomly, these output activations will also be more or less random.


Second, the networks actual output is compared to the desired output (specified in the pattern set)

      The error for each output node is computed as the difference between the desired and observed values for that node, multiplied by the derivative of the output nodes activation function at the node's current activation level.

      With linear nodes, the derivative of the activation is always the same.

      The derivative of the sigmoid activation function changes with the activation of the node, which depends on the strength of the connections leading to the node it is greatest when connections are relatively weak and smallest when they are strong.

      In essence then, the error is computed as greater whenconnections are weaker.


Third, given the error for each output node, connections into that node are modified so that the observed output will be more like the desired output.

      The formula used is:Dwij = hdipoj

o      In other words, the change in weight of the connection from node j to i is equal to the product of the learning constant h (eta), which expresses how strong a modification you want to make (which is often between 0.1 and 0.5) times the error for the unit i (which we calculated above), times the output from node j (because how much responsibility each node bears for the error depends in part on how active each node was).

      This operation is performed for each output node, but the weights are not yet changed.


Fourth, the same equation is used to calculate the weight changes to connections to the hidden nodes

      Of course, we dont have desired outputs for these hidden nodes, so instead of calculating error as before, the hidden nodes inherit the error of the nodes they activate.

      If the hidden node activates nodes that have large errors, then that hidden node shares the blame.

      The error of a hidden node is thus the sum of the errors of the nodes it activates, each multiplied by the weight of the connection from the hidden node (since a node is more responsible for an error on a node it connects to if its connection to that node is stronger.)

      This procedure is repeated for all layers down to the layer connected to the inputs. (This is why we call it backpropagation of error.)


Finally, the weights are changed.




Take a network with two inputs, one output, no bias, and no hidden layer.

      The inputs are 1 and 1.

      The strengths of the connections from i1 and i2 are 0.5 and 0.75, respectively.

      The desired output is 1.

      The learning rate is 0.5

      Calculate the error and the change that will be made to the connections.


You should know that the sigmoid activation function has the following values:























And the derivative of the sigmoid function has the following values: