Trabalhos Aceitos

Ordenados alfabeticamente pelo título

1. A Brief Survey of Deep Learning based methods, against OpenNLP NameFinder for Named Entity Recognition on Portuguese Literary Texts

Autores: Vinicius Amaro Sampaio, Mardônio J. C. França, Paulo Bruno Lopes da Silva, Gustavo Augusto Lima de Campos, Lara Domingos Hissa
Abstract: This paper is a brief comparative study of Named Entity Recognition (NER). It presents a evaluation of Apache's OpenNLP against CRF and Deep Learning methods and frameworks such as Tensorflow. This evaluation is done by looking at the F1-score. For this work we tagged and train our own models using samples of literary texts from Machado de Assis e Jayson Aguiar. We then see the potential of Deep Learning methods and plan on investigating more recent approaches and frameworks like Pytorch and LSTM+CRF, also we plan to expand the size of our corpus.

2. Análise das relações entre disciplinas do Ensino Médio do Brasil por meio de questões de vestibular com uso de técnicas de PLN

Autores: Rafael Telles, Margarethe Steinberger-Elias, André Kazuo Takahata, Luneque Silva Junior
Title: Analysis of relationships among high school subjects in Brazil through college entrance exams using NLP techniques
Abstract: This paper presents a study of how different high school subjects in Brazil can be identified through the classification of college entrance exam questions. The corpus built from a set of questions was processed by Natural Language Processing tools so they could be used as training and test data in a Machine Learning algorithm. The results indicate that the situations in which the classifier obtained the lowest accuracy represent exactly the most similar subjects, such as Portuguese and Literature or Physics and Chemistry. This can be explained by the fact that similar disciplines have a significant number of words in common.

3. Chatbot para auxiliar os discentes nos procedimentos administrativos de uma universidade

Autores: Wesley Benício dos Santos Silva, Márcio de Souza Dias, Nádia F. F. da Silva
Title: Chatbot to help students in the administrative procedures of a university
Abstract: Most new students of a university are unaware functioning of the university, which can cause inconvenience in requesting a service offered by the institution. Thus, the purpose of this project will be to develop a chatbot tool that will help students with questions about university procedures. This tool will seek to respond quickly and friendly to questions from students regarding the various procedures involved in the university, taking the necessary steps to resolve these questions.

4. Classificação de subjetividade para a língua portuguesa

Autores: Luana Balador Belisário, Luiz Gabriel Ferreira, Thiago Alexandre Salgueiro Pardo
Title: Subjectivity classification for Portuguese
Abstract: We report in this paper our investigation on the subjectivity classification task for Brazilian Portuguese. We reproduce the only known work in the area for this language and extend it to corpora of other domains. We report good results for machine learning and lexicon-based methods and show that these methods are significantly influenced by several factors, as size of the corpus, the balancing of classes and the preprocessing steps.

5. Compilação de um Banco Multilíngue de Acolhimento a Pessoas Refugiadas

Autores: Anna B. D. Furtado, Elisa D. Teixeira
Title: Compilation of a Multilingual Databank for Sheltering Refugees
Abstract: The migration crisis can be seen in every continent of the world. In Brazil, it is not different. In 2016, the country received 10.308 asylum claims. In this context, the MOBILANG (Mobilities and Languages in Contact) research group proposes to create a multilingual terminological databank from frequent word clusters obtained through texts and material refugees receive when they arrive in Brazil. In this work, we attempt to systematize the steps taken to compile a multilingual corpus on migration and asylum and extract clusters using Corpus Linguistics as our theoretical and methodological approach. We extract term candidates with the Sketch Engine software. As results, we present data on the corpus and a small glossary using the term candidates and their respective translations. Finally, we discuss the results reliability and future steps.

6. Do PDF ao TXT: Desafios na extração de informação em textos técnico-científicos

Autores: Aline Silveira, Elvis de Souza, Tatiana Cavalcanti, Cláudia Freitas
Title: From PDF to TXT: Challenges in information extraction from technical-scientific texts
Abstract: Information Extraction is a process that transforms non-structured text data into organized information according to a specific interest. Our research focuses on building an extensive corpus in Portuguese, composed of scientific texts from the oil and gas domain. We aim to facilitate semantic search in the area; for this, IE becomes fundamental. Nonetheless, for a natural language text to be machine-readable, it must go through an initial treatment. This paper highlights the relevance and the challenges of data processing. When a document contains elements such as images, tables or footnotes, the conversion from PDF to plain text (.txt) can display varied levels of deformation. The result of the preprocessing is a well-defined, organized text, ready to be analyzed and arranged at more a complex level. In order to seek guidance from previous works, we also investigated tasks from SemEval’s 2017 and 2018 editions.

7. ET: uma Estação de Trabalho para revisão, edição e avaliação de corpora anotados morfossintaticamente

Autores: Elvis de Souza, Cláudia Freitas
Title: A workstation for revising, editing and evaluating morphosyntactically annotated corpora
Abstract: Morphosyntactic annotation systems that use machine learning technology require large and well-annotated corpora to learn how to annotate texts properly. In general, to improve quality of automatic annotators, much has been done in relation to the technology behind the systems, however, there are still some bottlenecks in the quality of the material that serves as training, which ends up impacting negatively the quality of learning. We believe that a way to overcome these bottlenecks may be linguistically: improving the annotation of training corpora, making them more consistent and eliminating possible human errors. In this context, we present here a workstation drawn from the linguistic perspective with the objective of facilitating the revision, the editing and the evaluation of annotated corpora, aligning the work done by the language experts, on the one hand, and the practical results, that is, the performance of NLP systems, on the other hand. This way, theoretical discussions about grammatical categories may be based not only on language suitability to certain theories, but also on the empirical results of machine learning systems.

8. Identificação Automática de Erros em Sumários Multidocumento

Autores: Henrique Papa A. Fonseca, Márcio de Souza Dias, Nádia Félix Felipe da Silva
Title: Automatic Error Identification in Multi-Document Summaries
Abstract: Multidocument Summarization is an important area of Natural Language Processing (NLP), which generates a summary from several texts dealing with the same subject. However, despite being informative, the texts generated by summarizers present linguistic erros that affect their cohesion and coherence. Knowing that, this work proposes a tool based on heuristic methods capable of indexing internal linguistic errors in automatic summarizer texts. This work is still in progress, however, it is now capable of identifying the ‘Acronym without explanation’ error with an accuracy of 98.7%.

9. Investigação do uso de word embeddings para cálculo de similaridade em memórias de tradução

Autores: Karina Mayumi Johansson, Helena de Medeiros Caseli
Title: Research of the use of word embeddings for calculation of similarity in translation memories
Abstract: The strategy traditionally employed by the CAT tools to match the segments of the phrase being currently translated with the segments present in the translation memory considers the intersection of the sequence of words (n-grams) present in the segments of the text being compared. However, this strategy is not capable of capturing semantic similarities beyond the trivial level. This study therefore presents a project with the aim of investigating the applicability of monolingual and bilingual word embeddings to implement the matching. The study is still in its initial phase of development. In sequence, there will be proposed and implemented a strategy for the calculation of similarity using word embeddings, which will be incorporated in a open source CAT tool. In order to evaluate the proposed strategies, the quality of matching in the baseline system (a version of a CAT system without any modification) will be compared to those of the system in which the proposed method will be implemented. At the conclusion of this project is expected to have obtained a strategy based on semantic similarity that will be an alternative to the traditional matching strategy based on n-grams. Although there are already texts covering the use of word embeddings to detect the textual similarity and cleaning of translation memories, there is no literature about any work that has investigated the objective of this project. Consequently, this study should be considered as the first initiative to an investigation within this context.

10. Melhorias linguísticas no alinhador texto-imagem LinkPICS

Autores: Joao Gabriel Melo Barbirato, Helena de Medeiros Caseli
Title: Linguistic improvements on the text-image aligner LinkPICS
Abstract: Text-image alignment is the task of relating elements of a text to the elements of an accompanying image. Through this alignment it is possible, for example, to improve text retrieval results and provide accessibility on news sites. In this paper, we implement some linguistic-based improvements on LinkPICS, a tool for text-image alignment. Although LinkPICS has shown good results in the alignment of people and objects (VELTRONI; CASELI, 2018), it has some limitations for which some solutions have been proposed and implemented in this work. Therefore, this new version of LinkPICS can overcome these limitations by identifying synonyms using WordNet, tagging multi-word expressions and aligning one object to many image regions. Henceforth, after these improvements, the new version of LinkPICS performs the alignment of expressions containing more than one word and also recognizes synonyms. The evaluation shows 100% precision in synonym identification and 42.02% precision in multi-word expression tagging.

11. Preparação para Leitura Distante em português: diálogos entre PLN e Humanidades Digitais

Autores: Luísa Rocha, Cláudia Freitas, Diana Santos
Title: Preparing for Distant Reading in Portuguese: the dialogue between NLP and Digital Humanities
Abstract: Distant Reading is a technique developed by Franco Moretti that means to analyze patterns in texts of the literary genre. To apply this technique to literary works of the Portuguese language, we found it was important to survey a set of works on the matter, thus becoming more familiar with the area. Consequently, we've noticed that a few adjustments had to be made, and precautions had to be taken, when handling the OBras corpus. The correction of the grammatical genre of proper nouns, the addition of a semantic role, and the correction of the segmentation were the adjustments made and addressed in this paper so that exploring the OBras corpus can happen in the best way possible.

12. Reconhecimento de posicionamentos de natureza moral em textos

Autores: Wesley Ramos dos Santos, Ivandré Paraboni
Title: Moral Stance and polarity Recognition from Text
Abstract: In this project we introduce a labelled corpus of stances towards moral issues for the Brazilian Portuguese language, and present reference results for both the stance recognition and polarity classification tasks. The corpus was built from Twitter by searching for keywords denoting a number of target topics, and the reference results are expected to be taken as a baseline for further studies in the field of stance recognition and polarity classification from text.

13. Um estudo sobre desidentificação de evoluções clínicas

Autores: Thaila Elisa Quaini, Henrique D. P. dos Santos, Sandra C. de Abreu, Bernardo S. Consoli and Renata Vieira
Title: A study on de-identification of clinical notes
Abstract: Medical records of patients are important in the field of medical research. However, to obtain a patient identity, a Health Insurance Portability and Accountability Act (HIPAA) is required, which must be removed prior to the survey. Manual de-identification on large amounts of medical records data is expensive, time-consuming and error prone, requiring large-scale automated de-identification methods. This paper presents an analysis of the problem of the Brazilian Portuguese language, for a task of disidentification of electronic medical records. We compare the main types of business rule identification with an approach based on a list of names specially built for a task. The list of names was developed from the database and using the word Embedded to specialize the names through the semantic similarity between words.

Informação sobre a inscrição no evento

Alunos de graduação autores de trabalhos aceitos no TILic 2019 deverão se inscrever em www.bracis2019.ufba.br/#team selecionando o tipo de inscrição como BRACIS + STIL + ENIAC undergraduate.

Apenas os trabalhos que forem efetivamente apresentados no TILic 2019 serão publicados.

Modelo de Pôster

Os trabalhos aceitos no TILic 2019 serão apresentados em formato de pôster. Para a confecção dos pôsteres, fornecemos as seguintes instruções:

Sugestão de tamanho do pôster: 90cm de largura por 120cm de altura
Identificação: título do trabalho, nome(s) do(s) autor(es), seguido(s) dos nomes da instituição e da agência de fomento (se houver)
Seções do texto (apenas sugestão): introdução, objetivo ou proposta, fundamentação teórica, metodologia, resultados teóricos ou práticos (se houver) e referências bibliográficas
Tipo e tamanho de letra: ficam a critério do autor; sugerimos, entretanto, fonte tamanho 28, no mínimo

Apenas como sugestão, disponibilizamos um modelo para o pôster nos formatos a seguir:

Modelo PowerPoint (PPT e PPTX)