Training a simple recurrent network

Relevant readings:

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.

Marcus, G. F. (1998). Rethinking eliminative connectionism. Cognitive Psychology, 37(3), 243-282.

You will need to save a copy of the day1.tar.gz file on your computer and then decompress it within unix terminal by typing

tar -zxvf day1.tar.gz

you will get a directory with these files

srn.in # LENS model file for simple recurrent network

feedfor.in # LENS model file for feed-forward network

elman93.in # LENS model file for SRN with compression layers like Elman 1993

train.ex # training input

test.ex # testing file

To start the lens simulator, type in your xterm window.

lens -c srn.in

To look at the model, click on the "Unit Viewer" button. You should see the layers in a simple recurrent network (SRN = cword, context, hidden, word). In addition to the normal layers for an SRN, there is also a layer called targ, which is simply a buffer for the previous word that is set at the same time as the target for the next word. This will be important later, but here all that happens is that the targ activation is copied to the cword at the next time step.

During training, it is useful to close the "Unit Viewer" window, otherwise training can take a while.

To start a graph for error, click on the "New Graph" button and then press "ok". An error graph should start.

Try to train the model, by clicking on the "Train Network" button. In your lens console, you should see this.

Performing 20000 updates using Doug's Momentum...

__Update____Error___UnitCost__Wgt.Cost__Grad.Lin__TimeUsed__TimeLeft__

500) 9.73483 0.00000 267.698 -0.05014 0s 13s

1000) 11.5633 0.00000 370.845 -0.12867 0s 13s

1500) 9.18450 0.00000 468.432 -0.31921 1s 13s

2000) 9.95792 0.00000 551.523 0.24985 1s 12s

2500) 6.08390 0.00000 624.209 0.21504 1s 1

...

18500) 7.15489 0.00000 2229.37 0.04333 13s 1s

19000) 6.69672 0.00000 2275.40 0.22308 13s 0s

19500) 6.44954 0.00000 2318.36 0.48349 13s 0s

20000) 7.32600 0.00000 2347.85 0.16517 14s 0s

Performed 20000 updates

Total time elapsed: 14.133 seconds

The error is in the second column and it should also be in the error graph window. It will go up and down a bit, but it should slowly move downward.

After the model finishes learning, click on the "Unit Viewer" button again. Now you will be looking at the activations of the trained model. You will see the sentences in the training set of the left side. Click on the first sentence in the training set. You should be looking at the activation of the network at the beginning of the sentence. Since all the sentences in the grammar start with either "the" or "a", there should only two word units activated. To make this easier to see, go to the Palette and click on Hinton Diagram. Only two word units will be dark gray, and the rest should be black. One will have a white square around it. That one is the target unit, the one that the network is trying to predict. Because the srn model has no semantics, it can only predict articles at the beginning of a sentence.

Now click on the single right arrow key at the top right corner of the Unit Viewer window. It should be the third button from the right on the top row. Now you should see that the target unit is now active in the cword unit, because network always uses the previous word to predict the next word. In the top layer, the article units are not activated any more, and instead all of the noun units are activated (in the hinton diagram, you should see slight white lines).

Now click on the single right arrow again, and the cword layer has a copy of the previous word. Now in the word layer, the verbs should all have a slight white line, meaning that the system expects that a verb will be next.

Try going through the rest of the sentence and predicting the next set of words at each point.

Now, click on the Link Viewer button. You should see something like this.

This particular model seems to use hidden units 1, 4, and 6 to activate nouns (word units 1-16). Verbs are activated by hidden unit 5 and 2 (word units 17-28). And articles are activated by hidden units 9 and 2. Different models will have different units that represent these distinctions.

To get a feeling of how the model works, change the parameters. To see how error changes in different models, you should open up the error graph. Then click on "Reset Network". The error graph will reset, but it will still show your previous results. Then try changing the parameters in the main panel. For example, try changing the learning rate to 0.1 or 0.2. Try changing momentum to 0.7. Try changing the initial randomization range to 2. You can also try to change algorithm from "doug momentum" to "steepest descent". Always reset the network after each change, so that you are starting from the same starting point.

Generalization outside of the Training Space

In the training set, the word dog was not allowed to occur in the dative goal position. So this means that dogs could only be agents, since transitive patients could only be non-living things. So even though dog is not in the model's training space for recipients, we can test if it can still predict sentences where dog is the recipient. To test this, a test set was created where every sentence has a dog as recipient. Go through these test items (change the example set to the test set in the unit viewer), and see if there is any different in the activation values for dog in the unit viewer at the positions where dog should be expected in these sentences. Also take a look at the link viewer and see if there is any difference in the compress -> word mappings for dog and the other nouns.

Comparing SRN with a feed-forward model and an SRN with compression layers

To understand how the architecture of the SRN works, it is useful to compare it with a feed-forward model that doesn't have the context layer. Start up the feed-forward model by typing: lens -c feedfor.in

Train the model and look at the unit viewer and link viewer. How is dog represented in the links from the hidden layer?

Elman (1993) used an architecture which had "compression" layers between the word layer and the hidden units. These layers recode words into categories that are then used by the hidden layer. To examine this model, type: lens -c elman93.in. Again, train the model and look at the unit viewer and the link viewer. How is dog represented in the model?

Examining the LENS model code

Inside the srn.in file is the code that sets up the SRN. An important feature is that the context layer is before the hidden layer, which means that it updated (recieves copy of previous hidden layer activations) before the hidden layer is updated with new activations. Another feature that is unusual is the targ layer, which stores a copy of the next word. This is copied back to the cword layer to give that layer the previous word.

# variables for model layer sizes

set lexSize 45

set hiddenSize 15

set contextSize $hiddenSize

set compressSize 5

set ccompressSize 5

## create model

addNet srn -i 30

addGroup cword $lexSize ELMAN

addGroup context $contextSize ELMAN

addGroup hidden $hiddenSize

addGroup targ $lexSize INPUT

addGroup word $lexSize OUTPUT

## connect layers

connectGroups context hidden

connectGroups hidden word

connectGroups cword hidden

## create elman unit connections and initial states for context

elmanConnect hidden context -r 1 -init 0.5

## this creates a connection that copies the targ activation to the cword

elmanConnect targ cword -r 1 -init 0.0

The targ layer is necessary because of our environment files. These files have the format that we will be using later for the Dual-path model, so they have extra information that is not important here. The first line of each pattern is the pattern's name, which is the word sequence. Next there is a number representing the number of words in the utterance. The length here is 7, because in addition to the utterance, we also have two periods (these periods will be important in the later production models). Finally, we have the sequence of input (i:) and target (t:) pairs. The input is copied into the targ layer, which then copies it to the cword layer on the next time step. The target is used to generate error at the word layer, which is then backpropagated back in the network.

name:{ the nurse is run -ing . . }

7

i:{targ 1.0} 29

t:{word 1.0} 29

i:{targ 1.0} 11

t:{word 1.0} 11

i:{targ 1.0} 31

t:{word 1.0} 31

i:{targ 1.0} 20

t:{word 1.0} 20

i:{targ 1.0} 40

t:{word 1.0} 40

i:{targ 1.0} 38

t:{word 1.0} 38

i:{targ 1.0} 38

t:{word 1.0} 38;

To get a feedforward model, we just removed these lines from the srn.in file:

addGroup context $contextSize ELMAN

connectGroups context hidden

## create elman unit connections and initial states for context

elmanConnect hidden context -r 1 -init 0.5

To get a elman93 type model, we added these lines to the srn.in file:

addGroup ccompress $ccompressSize

addGroup compress $compressSize

and changed the connectGroups commands

connectGroups hidden compress word

connectGroups cword ccompress hidden