Linked WordNets for Ancient Indo-European languages

Related projects

The project Linked WordNets for Ancient Indo-European Languages builds upon and continues a previous project led by Silvia Luraghi and William Michael Short and aimed at enriching and harmonizing two existing WordNets for Ancient Greek and Sanskrit. The project resulted from an institutional collaboration agreement between the Universities of Pavia and Exeter.

Another related project is LiLa: Linking Latin (Università Cattolica del Sacro Cuore, PI: Marco Passarotti), under which the Latin WordNet has been extensively cleaned up and expanded.

The Ancient Greek WordNet

The Ancient Greek WordNet was first created in 2014 as a collaboration among the Institute of Computational Linguistics “Antonio Zampolli” in Pisa, the Perseus Project in Boston, the Open Philology Project in Leipzig, and the Alpheios Project in New York, and drawing from a previous collaboration with the University of Pavia (Sausa 2012).

The initial automatic construction of the Ancient Greek WordNet was achieved using Greek-English digitized lexica provided by the Perseus Project, especially the Middle-Liddell (Liddell and Scott 1889). The Greek word of the extracted bilingual pair was linked to every synset in the Princeton WordNet in which the English member of the pair appeared (Bizzoni et al. 2014: 1141). Furthermore, the Ancient Greek WordNet synsets are aligned to the Italian section of the MWN, to another Italian WordNet (IWN) (Roventini et al. 2003), developed at the Institute for Computational Linguistic “A. Zampolli” in Pisa, and to the Latin WordNet.

The automatic extraction was evaluated on a relevant sample of synsets, and erroneous matching were eliminated by identifying and filtering anachronistic domains, such as “aviation”.

At the time of the publication of Bizzoni et al. (2014), the Ancient Greek WordNet consisted of 35k different lemmas, with a coverage of 28% with respect to the total Greek lexicon, which contains approximately 120k lemmas.

References

Bizzoni, Yuri, Federico Boschetti, Harry Diakoff, Riccardo Del Gratta, Monica Monachini and Gregory R. Crane. 2014. The Making of Ancient Greek WordNet. In: Nicoletta Calzolari et al. (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC, vol. 2014), Reykjavik, Iceland, may 2014, 1140-1147. Accessed online at https://www.aclweb.org/anthology/L14-1054/.

Liddell, Henry George, and Robert Scott. 1889. An Intermediate Greek-English Lexicon. Clarendon Press, Oxford.

Roventini, Adriana et al. 2003. Italwordnet: Building a large semantic database for the automatic treatment of the italian language. Computational Linguistics in Pisa, Special Issue of Linguistica Computazionale, 18.

Sausa, Eleonora. 2012. Toward an Ancient Greek wordnet. In Workshop on WordNet and SketchEngine.

The Sanskrit WordNet

The Sanskrit WordNet builds on, and extends, original work by Oliver Hellwig for the Digital Corpus of Sanskrit (Hellwig 2017; 2010-2021).

Instead of automatically extracting bilingual pairs from a Sanskrit - English dictionary and mapping them onto synsets in the Princeton WordNet, the Sanskrit WordNet was created through genuine annotation of an ontology. The original data were taken from a public release of the OpenCyc knowledge base (Lenat 1995) containing concepts and knowledge about them in the form of relations such as kind_of, member_of, part_of, and instance_of. Based on their English descriptions, a substantial subset of the OpenCyc concepts were then mapped onto the English WordNet 2.1 and on WordNet forty-five lexicographer files, which contain supersenses based on syntactic category and logical groupings.*

Appropriate concepts were used for manually annotating selected texts from the Digital Corpus of Sanskrit with word semantic information, especially alchemical literature (Hellwig 2009), epics (Hellwig 2016) and, more recently, Vedic literature. If concepts required for the annotation of a text were not found in the original OpenCyc inventory, new concepts along with a brief description were added to the ontology. In addition, anachronistic concepts were partly removed. In this method, synsets are implicitly defined by the Sanskrit words assigned to the same ontological concept.

As a whole, ca. 600,609 tokens (and 32,227 lemmas) in the Digital Corpus of Sanskrit are provided with semantic annotation. The semantic network consists of 124,040 concepts and 194,092 relations. There are 24,401 Sanskrit-specific concepts; 50,595 concepts are mapped onto WordNet 2.1 and 78,198 onto the lexicographer files.

Semantic annotation performed by Hellwig has been directly imported from the Digital Corpus of Sanskrit into the new Sanskrit WordNet by William Short. The same is true for the morphological information, which has been integrated into each lemma of the Sanskrit WordNet; finally, lexical relations such as composition, derivation, parasynthesis, and conversion have been automatically extracted from the xml version of the Sanskrit-English dictionary Monier-Williams, where lemmas are listed under the root from which they derive and marked for the type of morphological relation that holds between them.

References

Hellwig, Oliver. 2009. Wörterbuch der mittelalterlichen indischen Alchemie, KP Eelde: Barkhuis.

Hellwig, Oliver. 2010-2021. The Digital Corpus of Sanskrit.

Hellwig, Oliver. 2016. A Computational Approach to the Text History of the Rāmāyaṇa. In: Proceedings of the DICSEP 2008, Ivan Andrijanić und Sven Sellmer. Zagreb (eds): Croatian Academy of Sciences und Arts, 41–62.

Hellwig, Oliver. 2017. Coarse semantic classification of rare nouns using cross-lingual data and recurrent neural networks. In IWCS 2017-12th International Conference on Computational Semantics-Long papers.

Lenat, Douglas B. 1995. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM 38.11: 33-38.

LiLa: Linking Latin

The LiLa project builds a Linked Data-based Knowledge Base of Linguistic Resources and Natural Language Processing (NLP) tools for Latin. The Knowledge Base consists of different kinds of objects connected via an explicitly-declared vocabulary for knowledge description.

LiLa collects and connects both existing and newly-generated (meta)data. The former are mostly linguistic resources (corpora, lexica, ontologies, dictionaries, thesauri) and NLP tools (tokenisers, lemmatisers, PoS-taggers, morphological analysers and dependency parsers) for Latin. These are currently available from different providers under different licences. With regard to newly-generated (meta)data, LiLa assesses a set of selected linguistic resources by expanding their lexical and/or textual coverage.

In particular, LiLa (a) enhances a large amount of Latin texts with PoS-tagging and lemmatisation, (b) harmonises the annotation of the three Universal Dependencies treebanks for Latin, (c) improves the lexical coverage of the Latin WordNet and the valency lexicon Latin-Vallex, and (d) expands the textual coverage of the Index Thomisticus Treebank. Furthermore, LiLa builds a set of newly-trained models for PoS-tagging and lemmatisation, and works on developing and testing the best performing NLP pipeline for such a task.

Connections between the aforementioned types of data are edges labelled with a restricted set of values (metadata) taken from a vocabulary of knowledge description. The Knowledge Base thus consists of a set of connections between target and source nodes.

References

Franzini Greta, Litta Eleonora, Ruffolo Paolo, Passarotti Marco, Testori Marinella. Latin WordNet Revision. DOI: 10.5281/zenodo.4030823

Franzini Greta, Peverelli Andrea, Ruffolo Paolo, Passarotti Marco, Sanna Helena, Signoroni Edoardo, Ventura Viviana, Zampedri Federica. 2019. Nunc Est Aestimandum: Towards an Evaluation of the Latin WordNet. In Bernardi Raffaella, Navigli Roberto, Semeraro Giovanni (eds.) Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019), 13-15 November, Bari. Accademia University Press, Torino. ISBN: 979-12-80136-00-8. DOI: 10.5281/zenodo.3518774

Mambrini Francesco, Passarotti Marco, Eleonora Litta, Giovanni Moretti. 2021. Interlinking Valency Frames and WordNet Synsets in the LiLa Knowledge Base of Linguistic Resources for Latin, in Alam Mehwish, Groth Paul, de Boer Victor, Pellegrini Tassilo, Pandit Harshvardhan J., Montiel Elena, Rodríguez Doncel Víctor, McGillivray Barbara, Meroño-Peñuela Albert (eds.), Further with Knowledge Graphs. Proceedings of the 17th International Conference on Semantic Systems, 6-9 September 2021, Amsterdam, The Netherlands, Series: Studies on the Semantic Web – Volume 53, IOS Press, Amsterdam, The Netherlands, 2021, pp. 16-28. ISBN: 978-1-64368-200-6 (print) | e-ISBN: 978-1-64368-201-3 (online). DOI: https://ebooks.iospress.nl/doi/10.3233/SSW210032. Zenodo: https://zenodo.org/record/5482432#.YTczUC0QND1

Page updated

Report abuse