Changing the training and testing sets

You will need to save the day4.tar.gz file and decompress it as before (tar -zxvf day4.tar.gz).

Here are the files:

dualpath4.in # model file for dual-path model

traine.ex # english files

testprode.ex

trainj.ex # japanese files

testprodj.ex

decode9.perl # translates model's activations into word sequences

syncode.perl # codes word sequences with syntactic categories

arcthis.perl # saves everything in a separate diretory, cleans up main directory

env/ # directory with programs for creating input environment (training and testing files)

generate.perl* # creates message-sentence pairs

translate.perl* # translates output of generate into LENS input file

envgramenglish # English grammar

envgramjapanese # Japanese grammar

envgramglorp # grammar used to produce glorp sentences

envgramnodoggoal # grammar used to generate training set without dog-goals

envgramonlydoggoal # grammar used to test only dog-goal sentences

In this part of the tutorial we will change the message-sentence pairs and hence modify the training and testing input. To do so, we need some information about the programs that generate the input sets. First, there are programs to create the message / semantic content. Other programs, called "grammars" transform the semantic message into sentences of a specific language. The examples below will assume that you have a basic understanding of these programs.

A Japanese Grammar

Our goal is to train the model on different languages, and attempt to see if and how the properties of the languages influence the model's behavior. Here we show how to change an English grammar in to a Japanese grammar. Respectively, we will start with the file envgramenglish and create the file envgramjapanese.

Lets first look at the constructions section:

mess: A=INTRANSVERB Y=LIVING,DET E=TENSE,ASPECT,YY

sent: Y1 Y0 A0 E0 E1 .

# the cat sleep -s

mess: A=TRANSVERB X=LIVING,DET Y=NONLIVING,DET E=TENSE,ASPECT,XX,YY

sent: X1 X0 A0 E0 E1 Y1 Y0 .

# the dog throw -s the stick

mess: A=TRANSVERB X=LIVING,DET Y=NONLIVING,DET E=TENSE,ASPECT,YY,-1,XX

sent: Y1 Y0 is E0 E1 A0 by X1 X0 .

# the stick is throw -par by the dog

mess: A=DATIVEVERB X=LIVING,DET Y=NONLIVING,DET Z=LIVING,DET E=TENSE,ASPECT,XX,YY,-1,ZZ

sent: X1 X0 A0 E0 E1 Y1 Y0 to Z1 Z0 .

# the girl throw -s the stick to the dog

mess: A=DATIVEVERB X=LIVING,DET Y=NONLIVING,DET Z=LIVING,DET E=TENSE,ASPECT,XX,ZZ,YY

sent: X1 X0 A0 E0 E1 Z1 Z0 Y1 Y0 .

# the girl throw -s the dog the stick

The first line specifies constraints on the message, the second line defines the way that message elements are mapped onto the sentence, and the final # line shows an example sentence that is generated by this pair.

According to the message (e.g. "the cat sleeps -s"), a word that is a member of the respective category (here: INTRANSVERB, e.g., "sleep") is selected and linked to the action role (A). Then, members of the categories LIVING and DET (e.g., "cat" and "the" respectively) are randomly selected and linked to the Y role (intransitive agents). Finally, members of the categories of TENSE, ASPECT, YY are selected, and so the event-semantics is set to the list PRES, SIMP, YY.

Now these elements have to be sequenced in an English sentence. The sent: line specifies how this is done. Each role (A, Y, EV) is an array of elements that starts with 0. So if the sent: line puts Y0 before Y1, then the word "cat" will come before the word "the". Since in English, determiners preceed nouns, the sent: line has Y1 before Y0. A0 is sequenced next, since the verb follows the subject in English. Finally, the information about tense and aspect are put after the verb to create the sequence ("sleep PRES SIMP"). The language-specific rewrite rules ("language grammar") will change this into "sleep -s".

Now what about Japanese. First of all, we don't change the mess: line, because our conservative assumption is that speakers of different language have the same conceptualization of the world. Second, Japanese is a relatively free word order language that uses particles to mark syntactic/thematic relationships (ga = subject/agent, ni = indirect object/recipient, wo = direct object/patient). It is verb-final and does not have determiners.

So the English transitive would be changed into

English sent: X1 X0 A0 E0 E1 Y1 Y0 . # the dog throw -s the stick

Japanese sent: X0 ga Y0 wo A0 E0 E1 # dog ga stick wo throw

Flexible word order allows Japanese speakers to scramble arguments when an English speaker would need a passive,

English sent: Y1 Y0 is E0 E1 A0 by X1 X0 . # the stick is throw -par by the dog

Japanese sent: Y0 wo X0 ga A0 E0 E1 # stick wo dog ga throw

The message has information about determiners. However, it is not used in the actual Japanese utterance.

Finally, language-specific rewrite rules are applied. Each rewrite rule has two parts s/SEARCH/REPLACE/. For example, the rule s/PAST SIMP/-ed/ replaces all occurrences of the string "PAST SIMP" with the English past tense morpheme "-ed". Because the English system has passives, we need to make the auxiliary agree with the subject. In the Japanese system, the verb remains unchanged during scrambling, and so we just need to convert the tense-aspect semantics into Japanese words. Compare the two systems.

English:

s/is PRES PROG (\S+)/is being $1 -par/;

s/is PAST PROG (\S+)/was being $1 -par/;

s/is PRES SIMP (\S+)/is $1 -par/;

s/is PAST SIMP (\S+)/was $1 -par/;

s/(\S+) PRES PROG/is $1 -ing/;

s/(\S+) PAST PROG/was $1 -ing/;

s/PRES SIMP/-s/;

s/PAST SIMP/-ed/;

Japanese:

s/ PRES PROG/ te iru/;

s/ PAST PROG/ te ita/;

s/ PRES SIMP//;

s/ PAST SIMP/ ta/;

To generate 20 Japanese utterances, type:

generate -n 20 envgramjapanese | less

Or take a look at the trainj.ex and testj.ex for Japanese and traine.ex and teste.ex for English. These files were generated by these commands in the env directory.

generate -n 1000 envgramjapanese | translate.perl > ../trainj.ex

generate -n 200 envgramjapanese | translate.perl | grep -v "i:" > ../testprodj.ex

generate -n 1000 envgramenglish | translate.perl > ../traine.ex

generate -n 200 envgramenglish | translate.perl | grep -v "i:" > ../testprode.ex

Try training the model with these files (remember to archive between models, e.g., arcthis.perl japanese).

To train Japanese models: lens -b dualpath4.in 'trainSave j;exit' &

To train English models: lens -b dualpath4.in 'trainSave e;exit' &

Take a look at the output files and also look at the model inside of the simulator.