Dataset / Tools

TSAC: Tunisian Sentiment Analysis Corpus

About 17k user comments manually annotated to positive and negative polarities. This corpus is collected from Facebook users comments written on official pages of Tunisian radios and TV channels namely Mosaique FM, JawhraFM, Shemes FM, HiwarElttounsi TV and Nessma TV. The corpus is collected from a period spanning January 2015 until June 2016.

For the use of TSAC corpus, please consider the following paper :

Paper:

Salima Mdhaffar, Fethi Bougares, Yannick Estève and Lamia Hadrich-Belguith. Sentiment analysis of Tunisian dialects: Linguistic Ressources and Experiments. WANLP 2017. EACL 2017

Link:

https://github.com/fbougares/TSAC

PASTEL: Performing Automated Speech Transcription for Enhancing Learning

The ANR PASTEL (Performing Automated Speech Transcription for Enhancing Learning) research project [1] (2017-2021) focused on the capabilities of speech transcription technology in a human learning environment.

The data in this repository was collected from the project CominOpenCourseware (COCo) [2] which provides several videos with potential resources (video, slides, time alignment of the video with the slide changes) and from the canal-U platform [3] which is an online digital video library of higher education. All the videos were manually transcribed by a human annotator expert using the Transcriber tool [4]. The conventions used for the evaluation campaign transcripts served as a guide for transcribing registered lectures.

Paper:

Salima Mdhaffar, Yannick Estève, Antoine Laurent, Nicolas Hernandez, Richard Dufour, Delphine Charlet, Geraldine Damnati, Solen Quiniou, Nathalie Camelin. A Multimodal Educational Corpus of Oral Courses: Annotation, Analysis and Case Study. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC), 2020

Link:

https://github.com/nicolashernandez/anr-pastel-data

TARIC-SLU: A Tunisian Benchmark Dataset For Spoken Language Understanding

About 9 hours of speech sourced from TARIC dataset. The acquisition of the TARIC dataset was carried out in train stations in Tunisia. The dataset is made of human-human recordings with their manual transcriptions and semantic annotations. It is composed of more than 2,000 dialogues from 109 different speakers. The dataset is split into three parts (train, dev and test).

Paper:

Salima Mdhaffar, Fethi Bougares, Renato De Mori, Salah Zaiem, Mirco Ravaneli, Yannick Estève, TARIC-SLU: A Tunisian Benchmark Dataset For Spoken Language Understanding, LREC, 2024.

Link:

https://demo-lia.univ-avignon.fr/taric-dataset/

Page updated

Google Sites

Report abuse