User guide‎ > ‎

Feature Model

DeSR uses a classifier to decide which action to perform during parsing. The classifier is trained on features extracted from the context of each action. The context consists of the stack, containing already processed tokens and the input queue, consisting of the remaining tokens on the input.

In the configuration file (default desr.conf) it is possible to specify which features to extract from the context.

The notation for expressing features is the following.

Path Expressions

Tokens are identified by path expressions, which can be nested.

 Type Description
 -i negative numbers identify tokens on the stack: -1 is the top of the stack, -2 is the second from the top, etc.
 i positive numbers identify tokens on the input queue, 0 is the next, 1 is the second, etc.
 leftChild the leftmost dependent of the token, if present
 rightChild the rightmost dependent of the token, if present
 parent the head of the token, if present
 leftSibling the left sibling of the token, if present
 rightSibling the right sibling of the token, if present
 prev the immediately preceding token in the input sentence ordering,, if present
 next the immediately following token in the input sentence, if present
 ancestor the ancestor of the token, if present
 leftDesc the leftmost descendant of token, if present
 rightDesc the rightmost descendant of token, if present

Features

Elementary token features are obtained from attributes of tokens. For example:

FORM(rightChild(0))

is the feature obtained from the attribute FORM of the rightmost dependent of the token on the input.

Composite features are the concatenation of elementary token features. For example:

LEMMA(-1) POSTAG(0) DEPREL(leftChild(-1))

is the concatenation of the LEMMA of the top of the stack, the POSTAG of the next token and the DEPREL of the leftmost dependent of the top of the stack.

In the configuration file, features are expressed one per line, starting with Feature. For example:

Feature LEMMA(-1) POSTAG(0) DEPREL(leftChild(-1))

As a shorthand it is possible to group elementary features using the same attribute as follows:

Features        POSTAG -1 0 1 rightChild(0) rightChild(rightChild(0))

which represents five features made from the attribute POSTAG extracted from the tokens denoted by the five path expressions.

Global Features

Global features are extracted from the global state of the parser.

 Name Type Description
 LexChildNonWord boolean Notice children containing non ASCII letters
 UseChildPunct boolean Notice punctuation in children of focus words
 StackSize boolean Record whether stack has > 1 items
 InputSize boolean Record whether input has > 1 items
 InPunct boolean Whether the number of punctuations is even/odd
 VerbCount boolean Enable count of preceding verbs
 PastActions int Latest actions
 WordDistance boolean Distance between top stack token and next input token
 PunctCount boolean Enable count of preceding punctuations
 MorphoSplit boolean    Extract individual morphological traits
 MorphoAgreement boolean Check morphological agreement
 MorphExtract boolean Extract morph items (gender, number, case) from morphology
 PrepChildEntityType boolean Note entity type (time or location) for children of prepositions

Feature Cutoff

The generated features can be controlled with additional options.

 Name TypeDescription
FeatureCutoff
 int Drop features occurring less than this times
LexCutoff int Collapse to Unknown forms or lemmas occurring less than LexCutoff

Feature Transformations

Attributes of tokens can be transformed before parsing by applying regular expression substitution.
This is currently available only for lemmas. For example, this rule specifies to normalize real numbers to 0.0:

LemmaReplace    [0-9]+[0-9,]*.[0-9]*   0.0