ASSIN (Avaliação de Similaridade Semântica e INferência textual) is a dataset with semantic similarity score and entailment annotations. It was used in a shared task in the PROPOR 2016 conference.
The full dataset has 10,000 sentence pairs, half of which in Brazilian Portuguese and half in European Portuguese. Either language variant has 2,500 pairs for training, 500 for validation and 2,000 for testing. This is different from the split used in the shared task, in which the training set had 3,000 pairs and there was no validation set. The shared task training set can be reconstructed by simply merging both sets.
You can find the official ASSIN evaluation script and baseline implementations in the GitHub repository. They are written in Python and require NumPy, SciPy and sklearn. One of the baselines also requires NLTK.
The evaluation script evaluates accuracy and macro F1 (the mean of the F1 scores of all classes) for textual entailment recognition and Pearson correlation and mean squared error for semantic similarity. It can be run as follows:
python assin-eval.py gold-file.xml system-file.xml
Or see its usage instructions with:
python assin-eval.py -h
FONSECA, E. R.; BORGES DOS SANTOS, L.; CRISCUOLO, M.; ALUÍSIO, S. M. Visão Geral da Avaliação de Similaridade Semântica e Inferência Textual. Linguamática, v. 8, n. 2, p. 3-13, 31 Dez. 2016.
The second edition of ASSIN took place in parallel with STIL 2019. Here you can find the dataset.
This is a list of published results we are aware of on ASSIN, besides the results of the participants of the shared task. Task indicates whether the paper is on Text Entailment (TE), Semantic Similarity (SS) or both.
Title Task Year
Avaliando a similaridade semântica entre frases curtas através de uma abordagem híbrida SS 2017
Análise de Medidas de Similaridade Semântica na Tarefa de Reconhecimento de Implicação Textual TE 2017
Statistical and Semantic Features to Measure Sentence Similarity in Portuguese SS 2017
Gradually Improving the Computation of Semantic Textual Similarity in Portuguese SS 2017
Recognizing Textual Entailment and Paraphrases in Portuguese TE 2017
Recognizing textual entailment: Challenges in the Portuguese language TE 2018
ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese SS 2018
Syntactic Knowledge for Natural Language Inference in Portuguese TE 2018
Enhancing Brazilian Portuguese Textual Entailment Recognition with a Hybrid Approach SS 2018
Following is a list with the people that annotated at least 200 sentence pairs in the ASSIN corpus and their affiliations.
Ana Paula dos Reis Lima (ICMC/USP)
Antonio Pedro Lavezzo Mazzarolo (ICMC/USP)
Carlos Alberto Schneider (ICMC/USP)
Clayton de Oliveira (ICMC/USP)
Darlan Santana Farias (ICMC/USP)
Edilson Anselmo Corrêa Júnior (ICMC/USP)
Erick Maziero (ICMC/USP)
Erick Rocha Fonseca (ICMC/USP)
Fabio Eduardo Araujo Cardoso (ICMC/USP)
Fernando Antônio Asevedo Nobrega (ICMC/USP)
Francielle Vargas (ICMC/USP)
Henrico Brum (ICMC/USP)
Ivone Penque Matsuno (ICMC/USP)
Leandro Borges dos Santos (ICMC/USP)
Lilian Berton (ICMC/USP)
Livy Real (IBM Research Brasil)
Magali Duran Sanches (ICMC/USP)
Marcelo Criscuolo (ICMC/USP)
Marcos Treviso (ICMC/USP)
Maria das Graças Volpe Nunes (ICMC/USP)
Mirella de Souza Balestero (DL/UFSCAR)
Rafael Hiroki Minami (ICMC/USP)
Roberta Sinoara (ICMC/USP)
Sandra Maria Aluisio (ICMC/USP)
Tamires Brito da Silva (ICMC/USP)
Thiago Pardo (ICMC/USP)
Vanessa Queiroz Marinho (ICMC/USP)