PaDAWan 2024: 1st Portuguese Data Augmentation Workshop (PaDAWan)
Belém, Pará, Brazil
collocated with STIL 2024
November 17th to 21th 2024
The Portuguese Data Augmentation Workshop (PaDAWan) aims to gather the community working on Data Augmentation, particularly employing Large Language Models (LLMs), in Portuguese.
With the advancement of LLMs, many traditional Natural Language Processing (NLP) tasks are being revisited. One traditional key challenge is gathering high-quality data for training and evaluating specific tasks. This has often been the main bottleneck in developing machine learning models. Data augmentation has become a crucial technique for enhancing the performance of these models across various tasks, especially when reliable data are limited. Nowadays, particularly with the use of LLMs, it has become feasible to apply sophisticated text data augmentation techniques effectively.
The use of LLMs is still very restricted due to several factors, such as costs, privacy concerns, latency issues, and other challenges. Given the current scenario, using LLMs to generate synthetic data to train classical models for specific tasks is a viable approach. Moreover, while many works in the industry consider synthetic data, scientific discussions on methods and evaluations are not always aligned with market necessities.
This workshop aims to delve into the use of LLMs for data augmentation, exploring possible methods, evaluation techniques, and associated ethical considerations. The goal is to bring together both industry professionals and academics to deeply discuss the topic.
We invite researchers to submit papers that discuss challenges and advances in Portuguese data generation, including but not limited to the following topics:
Data creation and data labeling
Data reformation and anonymization
Data contamination and noise
Co-annotation
Augmented data evaluation and controlled data augmentation
Ethics in generated data and unbiased data generation
Practical applications or case studies of data augmentation techniques
Challenges in Portuguese Synthetic/Augmented Data
Submissions should describe original, unpublished work. Authors are invited to submit two kinds of papers:
Full papers – Reporting substantial and completed work, especially those that may contribute in a significant way to the advancement of the area. Wherever appropriate, concrete evaluation results should be included. Full papers may consist of up to 8 pages of content, plus unlimited pages of references.
Short papers – Reporting small, focused contributions such as ongoing work, position papers, potential ideas to be discussed, or negative results. Short papers may consist of up to 4 pages of content, plus unlimited pages of references.
Both Full and Short papers will be published in the proceedings of the main conference in a special section. Author must follow the STIL guidelines for publishing.
Lightning Talks
We still invite lightning tasks: a 10 minute talk to overview or review an already published paper in the area.
Lightning talks are thought to make it possible to authors that already contributed to the field to share their learning with the community. They will not be published and their acceptance is upon slot time availability considering unpublished work first.
To submit your lightning talk, please submit a summary of your previous work up to two pages by the submission link and select "Lightning talk".
Submission system
Submissions should be made via the EasyChair system
Important Dates
Schedule
Deadline for long and short paper submission: September 10 11, 2024
Notification to authors: October 05 08, 2024
Camera-ready versions due: October 13, 2024
Organization
Livy Real - CE-PLN/SBC
Evandro Fonseca - Blip/PUCRS
Paula Cardoso - Universidade Federal do Pará
Program Commitee
Evelin Amorin (INESC TEC)
Helena Cameron (ESTGD)
Bernardo Gonçalves (C4AI)
Saullo Haniell (PUC CAMPINAS)
Eduardo Luz (UFOP)
Renan Mendes (BLIP)
Thiago Pardo (USP)
Jayr Pereira (UFPE)
Diana Santos (UIO)
Ivanovich Silva (UFRN)
Malo Souza (UFBA)
Marcos Spalenza (UFES)
Luis Trigo (LIAAD-INESC)
Clarissa Xavier (UFRGS)
Valeria de Paiva (Topos Institute)
Daniela Schmidt (UE)
Invited Speaker:
Rodrigo Nogueira (Maritaca AI):
Title: Can Generative AI Generate Knowledge?
Abstract: We have witnessed a remarkable improvement in the capabilities of generative models. Less than a decade ago, we were amazed by their performance on tasks such as image captioning and machine translation. Today, many of us use them to write code, and some of us even let them take control of our computers to accomplish tasks. However, a question still remains: are they just tools to help us in our everyday lives, or something more, endowed with intelligence? In this talk, we will discuss a more specific aspect of this debate: are they truly capable of producing new knowledge, or are they just "interpolating" existing data? We will see evidence that they are already producing new knowledge in some fields. Then, we will investigate how this capability will allow the creation of higher quality training data and how the distributed nature of this task can drastically reduce the development cost of LLMs.
Bio: Rodrigo Nogueira is the founder and CEO of Maritaca AI, a company specializing in the development of specialized LLMs in Brazil. He was a pioneer in the use of Transformers in search systems and co-author of the book "Pretrained Transformers for Text Ranking." Rodrigo holds a Ph.D. in Computer Science from New York University (NYU), having been mentored by the renowned Professor Kyunghyun Cho. Throughout his career, Nogueira has made contributions to the fields of Information Retrieval and Natural Language Processing through the creation of models such as BERTimbau, doc2query, monoT5, and more recently, the Sabiá 1, 2 and 3 models, which are specialized LLMs in Brazil.
Accepted Papers
Augmenting Data to Improve the Performance of Recommender Systems - Leticia Freire de Figueiredo, Joel Pinho Lucas and Aline Paes
Automated Topic Annotation in Brazilian Product Reviews: A Case Study of Adversarial Examples with Sabia-3 - Lucas Nildaimon dos Santos Silva and Livy Real
Brazilian Consumer Protection Code: a methodology for a dataset to Question-Answer (QA) Models - Aline Athaydes, Lucas Bulcao, Caio Sacramento, Babacar Mane, Daniela Claro, Marlo Souza and Robespierre Pita.
Getting Logic From LLMs: Annotating Natural Language Inference with Sabiá - Fabiana Avais, Marcos Carreira and Livy Real
LLM-SEMREL: Towards a Better Coreference Resolution for Portuguese - Evandro Fonseca and Joaquim Neto
Text extraction from Knowledge Graphs in the Oil and Gas Industry - Laura Milena Parra Navarro, Elvis A. de Souza and Marco Aurelio Pacheco
Program
The complete event schedule can be seen here. All presentations will be oral and will last for 15 minutes, followed by 5 minutes for discussion.
Picture Credits: Bruna Brandão - MTUR