posted May 20, 2012 3:43 PM by Giuseppe Attardi
Thanks to Mihai Surdeanu, I had the opportunity to test DeSR on the WSJ Penn Treebank, annotated with the Stanford Dependencies. I used the version of the corpus annotated with the Stanford basic dependency representation, which
are in a tree format. There is also another style of repesentation, called
collapsed dependencies, which collapses certain patterns, e.g., prepositions are
removed and added to the label of the collapsed dependency, like
"prep_in". These could be obtained by transforming the basic trees. My usual parser combination achieved the following scores on the development test set: Labeled attachment score: 41517 / 46451 * 100 = 89.38 %
Unlabeled attachment score: 42352 / 46451 * 100 = 91.18 %
Label accuracy score: 43163 / 46451 * 100 = 92.92 %
|
posted Apr 26, 2012 1:57 AM by Giuseppe Attardi
In several experiments of Domain Adaptation, we used the estimated likelihood that the parser computes for each transition as an indication of perplexity. Sentences with high perplexity are good candidates for Active Learning. See for example this paper: J. Atserias, G. Attardi, M. Simi, H. Zaragoza.
Active Learning for Building a Corpus of Questions for Parsing.
Proc. of LREC 2010, Malta,
2010. Since this feature might be useful for other purposes as well, I have added a command line option -p to
enable it.
Activating this option, the parser will print a tag in front of each
sentence.
<LogLikelihood all=-23.6232 avg=0.111773 min=5.50254e-11 />
All is the overall likelihood of the parse, avg is the average of
all the transitions, and min is the lowest of all transitions. The option only has effects when using MLP or ME classifiers, which provide probability distributions for the predictions.
|
posted Feb 14, 2012 3:57 AM by Giuseppe Attardi
[
updated Feb 14, 2012 6:36 AM
]
Using feature models involving the new composite features, I was able to improve the parser accuracy on the English Penn TreeBank. I have then used parser combination, according to the technique presented in our NAACL 2009 paper: G. Attardi, F. Dell'Orletta.
Reverse Revision and Linear Tree Combination for Dependency Parsing.
Proc. of NAACL HLT 2009,
2009. The combination of three parsers, a standard MLP model and two reverse revision MLP models, achieves the following scores: Labeled attachment score: 51497 / 57676 * 100 = 89.29 % Unlabeled attachment score: 52827 / 57676 * 100 = 91.59 % Label accuracy score: 53910 / 57676 * 100 = 93.47 % which are at the very best of those reported at the ConLL 2008 Shared Task. This is an excellent result considering that the DeSR parser combination is still a fast linear process, while the best at CoNLL 2008 was a 2rd order MST parser which adopted the expensive search procedure by Carreras (2007). |
posted Feb 11, 2012 1:14 AM by Giuseppe Attardi
[
updated Feb 11, 2012 2:07 AM
]
Composite features can now be used, estracted from several tokens and using different attributes. So far features were extracted from a single token estratte da un singolo token ed erano elementari. For example, the following notation: Features LEMMA -1 0 1 leftChild(0) rightChild(prev(0))
meant to use as feature the lemma of tokens determined respectively as: -1 first on the stack 0 next in input queue 1 second in input leftChild(0) left child of next rightChild(prev(0)) right child of token immediately
preceding the next in input
Now it is possible to denote composite features, extracted from different tokens in the following way: Feature LEMMA(-1) POSTAG(0) DEPREL(leftChild(-1))
The feature is the concatenation of three elementary features extracted from tokens denoted in the same way as before. The old notation is still available for back compatibility. Adding the following composite features to the English model: Feature CPOSTAG(-1) CPOSTAG(0) Feature CPOSTAG(0) CPOSTAG(1) Feature CPOSTAG(-1) CPOSTAG(1)
provided an improvement to Labeled attachment score: 50685 / 57676 * 100 = 87.88 % Unlabeled attachment score: 52066 / 57676 * 100 = 90.27 % on the English Penn TreeBank. With parser combination DeSR achieves: Labeled attachment score: 51277 / 57676 * 100 = 88.91 % Unlabeled attachment score: 52612 / 57676 * 100 = 91.22 % |
posted Oct 18, 2010 9:53 AM by Giuseppe Attardi
DeSR is first in UAS on all languages with an average of 90.94.
On LAS DeSR is first with 88.98 su Hindi-Coarse (where I tuned it most) and second on Hindi-Fine with 87.49.
Bangla and Telugu have very small training sets (7000 token) and nonetheless DeSR is second or third at about 1% distance. |
posted Aug 20, 2010 4:23 AM by Giuseppe Attardi
[
updated Aug 20, 2010 4:30 AM
]
A new release is available for download.
Most notable changes:
- SWIG wrapper for Java
- configurable option for lemma normalization based on regular expressions
- fixes to memory management for Context in tokens
|
posted May 15, 2010 4:49 AM by Giuseppe Attardi
Release 1.2.2 is available for download.
It includes some missing file from boost and allows now to compile also without python support.
A model for French has been also made available.
|
posted Apr 5, 2010 1:40 AM by Giuseppe Attardi
DeSR has been trained on the French Treebank, provided by Marie Candito and described in a forthcoming LREC2010 paper entitled: "Statistical French dependency parsing: treebank conversion and first results".
DeSR achieves the following accuracy scores on a split of 1235 sentence for test, 1235 development and the rest for training:
Labeled attachment score: 27751 / 31404 * 100 = 88.37 % Unlabeled attachment score: 28506 / 31404 * 100 = 90.77 % Label accuracy score: 29205 / 31404 * 100 = 93.00 %
|
posted Jan 25, 2010 9:34 AM by Giuseppe Attardi
Release 1.2.1, available on Sourceforge, has been updated in order to compile also with Visual C++ 2008 Express. |
posted Jan 25, 2010 9:33 AM by Giuseppe Attardi
|