Strona główna

NLPre: a revised approach towards language-centric benchmarking
of Natural Language Preprocessing systems

Martyna Wiącek, Piotr Rybak, Łukasz Pszenny, Alina Wróblewska
Institute of Computer Science, Polish Academy of Sciences
{m.wiacek, p.rybak, l.pszenny, alina}@ipipan.waw.pl

Accepted at LREC-COLING 2024

Deployed Benchmarks

NLPre-PL

NLPre-ZH

NLPre-GA

Resources

Source Code

Paper

Datasets & Models

Abstract

With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries. Aware of the shortcomings of existing NLPre evaluation approaches, we investigate a novel method of reliable and fair evaluation and performance reporting. Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools, while credibly tracking their performance. The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark. Based on this benchmark, we conduct an extensive evaluation of a variety of Polish NLPre systems. To facilitate the construction of benchmarking environments for other languages, e.g. NLPre-GA for Irish or NLPre-ZH for Chinese, we ensure full customization of the publicly released source code of the benchmarking system.

Contributions

NLPre benchmark customizable source code

As a part of our work, we provide customizable code, facilitating the creation of similar NLPre benchmarks for different languages. Such code can be easily modified via YAML config file + a WYSIWYG editor.

NLPre resources for Polish language

As a part of our work, we configure a prototype of our system for the Polish language. To build such a system and prove that it is possible to use it for multiple tagsets, we provide a new dataset, NLPre-PL dataset available in Morfeusz tagset.

This dataset is based on NKJP1M dataset and is divided into train, dev, and test set fairly by paragraphs (according to the name of the original document or the type of the original document).

Moreover, in our system, for UD tagset, we provide one additional test set of the narrative domain subdivision of the manually annotated Polish Dependency Bank 3.0 automatically converted into the UD schema.

All models and datasets provided in NLPre-PL deployed benchmark are available on Huggingface.

Evaluation of Polish NLPre systems

In our paper, we provide evaluation of Polish NLPre systems and we discuss possible reasons for the results we obtained.

Platforms for Irish and Chinese

To support our claim that the system is language agnostic, we set up NLPre-GA for Irish and NLPre-ZH for Chinese. The choice of those languages is not arbitrary; our objective is to demonstrate the capability of the platform in evaluating diverse languages, including those based on non-Latin scripts.

How to cite

@inproceedings{wiacek-etal-2024-nlpre-revised,

title = "{NLP}re: A Revised Approach towards Language-centric Benchmarking of Natural Language Preprocessing Systems",

author = "Wi{\k{a}}cek, Martyna and

Rybak, Piotr and

Pszenny, {\L}ukasz and

Wr{\'o}blewska, Alina",

editor = "Calzolari, Nicoletta and

Kan, Min-Yen and

Hoste, Veronique and

Lenci, Alessandro and

Sakti, Sakriani and

Xue, Nianwen",

booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",

month = may,

year = "2024",

address = "Torino, Italia",

publisher = "ELRA and ICCL",

url = "https://aclanthology.org/2024.lrec-main.1073",

pages = "12271--12287",

abstract = "With the advancements of transformer-based architectures, we observe the rise of natural language preprocessing (NLPre) tools capable of solving preliminary NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or morphological analysis) without any external linguistic guidance. It is arduous to compare novel solutions to well-entrenched preprocessing toolkits, relying on rule-based morphological analysers or dictionaries. Aware of the shortcomings of existing NLPre evaluation approaches, we investigate a novel method of reliable and fair evaluation and performance reporting. Inspired by the GLUE benchmark, the proposed language-centric benchmarking system enables comprehensive ongoing evaluation of multiple NLPre tools, while credibly tracking their performance. The prototype application is configured for Polish and integrated with the thoroughly assembled NLPre-PL benchmark. Based on this benchmark, we conduct an extensive evaluation of a variety of Polish NLPre systems. To facilitate the construction of benchmarking environments for other languages, e.g. NLPre-GA for Irish or NLPre-ZH for Chinese, we ensure full customization of the publicly released source code of the benchmarking system. The links to all the resources (deployed platforms, source code, trained models, datasets etc.) can be found on the project website: https://sites.google.com/view/nlpre-benchmark.",

}

Acknowledgments

Page updated

Google Sites

Report abuse

NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems