(released with Coling 2014 paper: What good are 'Nominalkomposita' for 'noun compounds':
Multilingual Extraction and Structure Analysis of Nominal Compositions using Linguistic Restrictors )
This database contains automatically extracted English noun compounds and their translations in up to ten languages, extracted from the OPUS Europarl resource.
The extracted languages:
Danish
Dutch
English
French
German
Greek
Italian
Portuguese
Romanian
Spanish
Swedish
PropBank semantic role annotations on French and English sections of Europarl.
(Package S2.1, data released by the FP7 CLASSiC Project)
README for this package
There are three packages that provide syntactic and semantic annotations for the Europarl corpus (Koehn, 2005).
The Europarl parallel corpus is extracted from the proceedings of the European Parliament. The format used is
the CoNLL09 format described in http://ufal.mff.cuni.cz/conll2009-st/task-description.html (see 'Data format').
1) Package S2.1-1: "The EuroParl parallel corpus: Hand-annotated French data" (1MB txt file) contains 1000
French sentences manually annotated using the annotation scheme of PropBank. The syntactic annotations are
the output of a parser (Titov and Henderson, 2007) trained on the dependency conversion of the French Treebank
into dependency format (Candito et al. ,2009).
2) Package S2.1-2: "The EuroParl parallel corpus: Parsed English Data" (114MB gz file) contains 983K English
sentences from the Europarl corpus and their syntactic-semantic analysis as provided by the parser
(Henderson et al., 2008, Titov et al., 2009) that has been trained on the merge of The Penn Treebank corpus
with PropBank labels and NomBank labels.
3) Package S2.1-3: "The EuroParl parallel corpus: Parsed French Data" (110MB gz file) contains 983K French
sentences from the Europarl corpus and their syntactic-semantic analysis as they result from our work on
automatic cross-lingual semantic role annotation (Van der Plas et al. 2011).
---- References:
M.-H. Candito, B. Crabbé́ , P. Denis, and F. Guérin. 2009. Analyse syntaxique du francais : des constituants ̧
aux dépendances. In Proceedings of TALN, Senlis, France.
J. Henderson, P. Merlo, G. Musillo, and I. Titov. 2008. A latent variable model of synchronous parsing for syn-
tactic and semantic dependencies. In Proceedings of CONLL 2008, Manchester, UK.
P. Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of
the MT Summit 2005, Phuket, Thailand.
L. van der Plas, P. Merlo and J. Henderson. 2011. Scaling up Cross-Lingual Semantic Annotation Transfer
In Proceedings of ACL/HLT, Portland, US.
I. Titov and J. Henderson. 2007. A latent variable model for generative dependency parsing. In Proceedings of
the International Conference on Parsing Technologies (IWPT-07), Prague, Czech Republic.
I. Titov, J. Henderson, P. Merlo, and G. Musillo. 2009. Online graph planarisation for synchronous parsing of
semantic and syntactic dependencies. In Proceedings of the twenty-first international joint conference on ar-
tificial intelligence (IJCAI-09), Pasadena, California.
---------------- END OF README FILE ----------------------------------