Linked WordNets for Ancient Indo-European languages

Project description

Linked WordNets for Ancient Indo-European languages

This project aims to expand and link three existing WordNets of ancient Indo-European languages (Sanskrit, Ancient Greek, and Latin) with other linguistic resources.
WordNets are lexical databases representing the lexicon in a relational way. Meanings are associated to lexical entries as synsets, brief glosses identified by an ID number. The lexical entries sharing the same synset form a synonymic set. Synsets are connected with one another through conceptual-semantic relations, while lexical entries are linked through lexical relations. The original WordNet was designed for English (Miller et al. 1990); later, several WordNets have been developed for modern and ancient languages (on Latin and Ancient Greek, see Minozzi 2009, Bizzoni et al. 2014, Boschetti 2019).
The first goal of this project is to harmonize and refine these three WordNets and make them interoperable. WordNets have the potential to allow for crosslinguistic semantic comparison by using the same set of synsets from the original WordNet. To ensure interoperability, our WordNets will share the same architecture, theoretical framework, annotation workflow, and guidelines. In addition, the creation of new synsets and other structural adjustments will be kept to the very minimum.
However, we will introduce some crucial innovations. Our lexicographic work will be framed within a principled view of polysemy drawn from cognitive linguistics (Winters et al. 2020). Furthermore, we will enrich our WordNets with philological and morphological information (periodization, literary genre, loci of attestation, principal parts and alternative/irregular forms of paradigms, etymology). These addenda will account for the dynamicity of languages’ lexicon and make our WordNets appealing to a larger audience of scholars and students.
Our second goal is to enlarge the WordNets. To do this, our methods will combine automatic and manual annotations. We will import as much data as possible from available resources (e.g., etymological and domain-specific dictionaries, morphological analyzers and lemmatizers). We will try and evaluate the application of data-driven methods to ancient languages, such as word embeddings (Khodak 2017), parallel corpora (Apidanaki/Sagot 2014), and automatic hypernym discovery from learnt syntactic patterns (Snow et al. 2005); the results obtained will be validated by human annotators. Human annotators will also perform parallel manual annotations on sets of agreed near-equivalents in the three languages.
Our third goal is to link our WordNets with other textual and lexical resources by implementing the principles of Linguistic Linked Open Data (Cimiano et al. 2020). In particular, we will add sentence and semantic frame information to verbal entries, by linking them with morphosyntactically annotated corpora, valency lexica, and FrameNet.
Finally, we aim to make our WordNets and interlinked resources accessible for everyone through a user-friendly open-source interface.

References

Apidianaki M., B. Sagot. 2014. Data-driven Synset Induction and Disambiguation for Wordnet Development. LREV 48: 655-677.
Bizzoni Y., F. Boschetti, H. Diakoff et al. 2014. The Making of Ancient Greek WordNet. https://www.aclweb.org/anthology/L14-1054/.
Boschetti F. 2019. Semantic Analysis and Thematic Annotation. In M. Berti (ed), Digital Classical Philology, 321-339. Berlin: DeGruyter.
Cimiano P., C. Chiarcos, J. McCrae, J. Garcia. 2020. Linguistic Linked Data: Representation, Generation and Applications. Berlin: Springer.
Khodak M., A. Risteski, C. Fellbaum, A. Sanjeev. 2017. Automated WordNet Construction Using Word Embeddings. https://aclanthology.org/W17-1902/.
Miller G., R. Beckwith, C. Fellbaum et al. 1990. Introduction to WordNet: An on-line lexical database. International journal of lexicography 3(4): 235-244.
Minozzi S. 2009. The Latin WordNet Project. In P. Anreiter, M. Kienpointner (eds), Latin Linguistics Today. Akten des 15. Internationalem Kolloquiums zur Lateinischen Linguistik. Innsbrucker Beiträge zur Sprachwissenschaft 137: 707-716.
Snow R., D. Jurafsky, Y. Andrew. 2005. Learning syntactic patterns for automatic hypernym Discovery. Advances in neural information processing systems 17: 1297-1304.
Winters M., H. Tissari, K. Allan (eds). 2010. Historical cognitive linguistics. Berlin: DeGruyter.

Page updated

Report abuse