OpenMLP can now be exploited for training the SVM classifier using multithreding.
If you have multiple cores and your compiler supports OpenMLP, you can exploit parallelism by setting the following environment variable to the number of cores you want to use:
A trained model for Spanish has been added among the Pre-trained Models. The model has been trained on the Spanish AnCora-ES Treebank developed by the Ancora project:
M. Antonia Martí, MarionaTaulé, Lluís Márquez, Manuel Bertran (2007) “Ancora: A Multilingual and Multilevel Annotated Corpus” in http://clic.ub.edu/ancora/publications/
Thanks to Mihai Surdeanu, I had the opportunity to test DeSR on the WSJ Penn Treebank, annotated with the Stanford Dependencies.
I used the version of the corpus annotated with the Stanford basic dependency representation, which are in a tree format.
There is also another style of repesentation, called collapsed dependencies, which collapses certain patterns, e.g., prepositions are removed and added to the label of the collapsed dependency, like "prep_in".
These could be obtained by transforming the basic trees.
My usual parser combination achieved the following scores on the development test set:
Labeled attachment score: 41517 / 46451 * 100 = 89.38 %
Unlabeled attachment score: 42352 / 46451 * 100 = 91.18 %
Label accuracy score: 43163 / 46451 * 100 = 92.92 %
These results can be compared with those obtained by Mihai using an ensemble of MaltParsers.
In several experiments of Domain Adaptation, we used the estimated likelihood that the parser computes for each transition as an indication of perplexity. Sentences with high perplexity are good candidates for Active Learning.
See for example this paper:
J. Atserias, G. Attardi, M. Simi, H. Zaragoza. Active Learning for Building a Corpus of Questions for Parsing. Proc. of LREC 2010, Malta, 2010.
Since this feature might be useful for other purposes as well, I have added a command line option -p to enable it.
Activating this option, the parser will print a tag in front of each sentence.
<LogLikelihood all=-23.6232 avg=0.111773 min=5.50254e-11 />
All is the overall likelihood of the parse, avg is the average of all the transitions, and min is the lowest of all transitions.
The option only has effects when using MLP or ME classifiers, which provide probability distributions for the predictions.
Using feature models involving the new composite features, I was able to improve the parser accuracy on the English Penn TreeBank.
I have then used parser combination, according to the technique presented in our NAACL 2009 paper: G. Attardi, F. Dell'Orletta. Reverse Revision and Linear Tree Combination for Dependency Parsing. Proc. of NAACL HLT 2009, 2009.
The combination of three parsers, a standard MLP model and two reverse revision MLP models, achieves the following scores:
Labeled attachment score: 51497 / 57676 * 100 = 89.29 %
Unlabeled attachment score: 52827 / 57676 * 100 = 91.59 %
Label accuracy score: 53910 / 57676 * 100 = 93.47 %
which are at the very best of those reported at the ConLL 2008 Shared Task.
This is an excellent result considering that the DeSR parser combination is still a fast linear process, while the best at CoNLL 2008 was a 2rd order MST parser which adopted the expensive search procedure by Carreras (2007).
Composite features can now be used, estracted from several tokens and using different attributes.
So far features were extracted from a single token estratte da un singolo token ed erano elementari.
For example, the following notation:
Features LEMMA -1 0 1 leftChild(0) rightChild(prev(0))
meant to use as feature the lemma of tokens determined respectively as:
-1 first on the stack
0 next in input queue
1 second in input
leftChild(0) left child of next
rightChild(prev(0)) right child of token immediately preceding the next in input
Now it is possible to denote composite features, extracted from different tokens in the following way:
Feature LEMMA(-1) POSTAG(0) DEPREL(leftChild(-1))
The feature is the concatenation of three elementary features extracted from tokens denoted in the same way as before.
The old notation is still available for back compatibility.
Adding the following composite features to the English model:
Feature CPOSTAG(-1) CPOSTAG(0)
Feature CPOSTAG(0) CPOSTAG(1)
Feature CPOSTAG(-1) CPOSTAG(1)
provided an improvement to
Labeled attachment score: 50685 / 57676 * 100 = 87.88 %
Unlabeled attachment score: 52066 / 57676 * 100 = 90.27 %
on the English Penn TreeBank.
With parser combination DeSR achieves:
Labeled attachment score: 51277 / 57676 * 100 = 88.91 %
Unlabeled attachment score: 52612 / 57676 * 100 = 91.22 %
DeSR particiapted to the ICON 2010 contest on parsing Indian languages
DeSR is first in UAS on all languages with an average of 90.94.
On LAS DeSR is first with 88.98 su Hindi-Coarse (where I tuned it most) and second on Hindi-Fine with 87.49.
Bangla and Telugu have very small training sets (7000 token) and nonetheless DeSR is second or third at about 1% distance.
A new release is available for download.
Most notable changes:
Release 1.2.2 is available for download.
It includes some missing file from boost and allows now to compile also without python support.
A model for French has been also made available.
DeSR has been trained on the French Treebank, provided by Marie Candito and described in a forthcoming LREC2010 paper entitled: "Statistical French dependency parsing: treebank conversion and first results".
DeSR achieves the following accuracy scores on a split of 1235 sentence for test, 1235 development and the rest for training:
Labeled attachment score: 27751 / 31404 * 100 = 88.37 %
Unlabeled attachment score: 28506 / 31404 * 100 = 90.77 %
Label accuracy score: 29205 / 31404 * 100 = 93.00 %
1-10 of 19