DeSR uses a classifier to decide which action to perform during parsing. The classifier is trained on features extracted from the context of each action. The context consists of the stack, containing already processed tokens and the input queue, consisting of the remaining tokens on the input.
In the configuration file (default desr.conf) it is possible to specify which features to extract from the context.
The notation for expressing features is the following.
, which can be nested.
| Type|| Description|
| -i|| negative numbers identify tokens on the stack: -1 is the top of the stack, -2 is the second from the top, etc.|
| i|| positive numbers identify tokens on the input queue, 0 is the next, 1 is the second, etc.|
| leftChild|| the leftmost dependent of the token, if present|
| rightChild|| the rightmost dependent of the token, if present|
| parent|| the head of the token, if present|
| leftSibling|| the left sibling of the token, if present|
| rightSibling|| the right sibling of the token, if present|
| prev|| the immediately preceding token in the input sentence ordering,, if present|
| next|| the immediately following token in the input sentence, if present|
| ancestor|| the ancestor of the token, if present|
| leftDesc|| the leftmost descendant of token, if present|
| rightDesc|| the rightmost descendant of token, if present|
FeaturesElementary token features
are obtained from attributes of tokens. For example:
is the feature obtained from the attribute FORM of the rightmost dependent of the token on the input.Composite features
are the concatenation of elementary token features. For example:
LEMMA(-1) POSTAG(0) DEPREL(leftChild(-1))
is the concatenation of the LEMMA
of the top of the stack, the POSTAG
of the next token and the DEPREL
of the leftmost dependent of the top of the stack.
In the configuration file, features are expressed one per line, starting with Feature. For example:
Feature LEMMA(-1) POSTAG(0) DEPREL(leftChild(-1))
As a shorthand it is possible to group elementary features using the same attribute as follows:
Features POSTAG -1 0 1 rightChild(0) rightChild(rightChild(0))
which represents five features made from the attribute POSTAG extracted from the tokens denoted by the five path expressions.
Global features are extracted from the global state of the parser.
| Name|| Type|| Description|
| LexChildNonWord|| boolean|| Notice children containing non ASCII letters|
| UseChildPunct|| boolean|| Notice punctuation in children of focus words|
| StackSize|| boolean|| Record whether stack has > 1 items|
| InputSize|| boolean|| Record whether input has > 1 items|
| InPunct|| boolean|| Whether the number of punctuations is even/odd|
| VerbCount|| boolean|| Enable count of preceding verbs|
| PastActions|| int|| Latest actions|
| WordDistance|| boolean|| Distance between top stack token and next input token|
| PunctCount|| boolean|| Enable count of preceding punctuations|
| MorphoSplit|| boolean || Extract individual morphological traits|
| MorphoAgreement|| boolean|| Check morphological agreement|
| MorphExtract|| boolean|| Extract morph items (gender, number, case) from morphology|
| PrepChildEntityType|| boolean|| Note entity type (time or location) for children of prepositions|
The generated features can be controlled with additional options.
| Name|| Type||Description |
|FeatureCutoff|| int|| Drop features occurring less than this times|
|LexCutoff|| int|| Collapse to Unknown forms or lemmas occurring less than LexCutoff|
Attributes of tokens can be transformed before parsing by applying regular expression substitution.
This is currently available only for lemmas. For example, this rule specifies to normalize real numbers to 0.0:
LemmaReplace [0-9]+[0-9,]*.[0-9]* 0.0