Corpora / Corpus

If you use these corpora, PLEASE use the indicated reference in your publications.

Si vous utilisez ces corpus MERCI de mettre la référence indiquée dans vos pubilcations.

إذا استخدمت هذه البيانات ، فيرجى اضافة المرجع المشار إليه في منشوراتك

TD-COM corpus (2022)

TD-COM: it is a parallel corpus in Tunisian dialect and modern standard Arabic. It contains 3000 comments in Tunisian dialect extracted from social networks and manually translated into standard Arabic by a native speaker. This corpus is available on https://github.com/sk-cmd/ressources-parallele-DT-ASM

Reference 1 : Sameh Kchaou, Rahma Boujelbane and Lamia Hadrich Belguith (2022). Hybrid pipeline for building Arabic

Tunisian Dialect-Standard Arabic Neural machine translation model from scratch. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). https://doi.org/https://dl.acm.org/doi/10.1145/3568674

Reeference 2 : Sameh Kchaou, Rahma Boujelbane, Emna Fsih and Lamia Hadrich Belguith (2022). Standardisation of Dialect Comments in Social Networks in View of Sentiment Analysis : Case of Tunisian Dialect. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5436–5443, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.582

TA-Segmentation Corpus (2021)

Lien : https://github.com/AsmaMekki/TA-Segmentation-Corpus

TA-Segmentation Corpus is used for the sentence segmentation of the three forms of Tunisian Dialect. This corpus is composed of 260.364 words and 33.581 sentences. It is manually normalized according to the orthographic convention CODA-TA. Also, sentence segmentation has been validated by native experts.

Reference: Asma Mekki, Inès Zribi, Mariem Ellouze et Lamia Hadrich Belguith, « Sentence boundary detection of various forms of Tunisian Arabic », Language Resources and Evaluation, pages 1-29, 2021. DOI: 10.1007/S10579-021-09538-4

TTB (Tunisian Treebank) (2020)

Lien : https://github.com/AsmaMekki/TA-Parser/tree/main

TTB (Tunisian Treebank): We syntactically annotated the Tunisian constitution, which contains 12.378 words, and 1.072 sentences of STAC corpus. It follows the CODA-TUN orthographic convention. It is also well segmented and tokenized. The preprocessed corpus was analyzed syntactically by the Stanford parser of MSA. The final step in creating TTB is to fix annotation errors made by the MSA Stanford parser and validate it by experts. The model (model_SMD.gz) was trained using 8.000 sentences and 79.604 tokens. The best evaluation result reached an F-measure of 80.12%. It can be used by integrating it into the Stanford parser (Link: https://nlp.stanford.edu/software/lex-parser.shtml).

Reference: Asma Mekki, Inès Zribi, Mariem Ellouze et Lamia Hadrich Belguith, « Treebank creation and parser generation for Tunisian Social Media text », The 17th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2020), Antalya, Turkey, 2020. DOI: 10.1109/AICCSA50499.2020.9316462

ArSentimentAnalysis Corpus (2019)

Lien GitHub : https://github.com/amirabaroumi/ArSentimentAnalysis

Le package ArSentimentAnalysis comprend un ensemble de ressources permettant de concevoir et évaluer un système d’analyse d’opinions en arabe. Il contient :

1/ Ensembles d’embeddings spécifiques à l’arabe

Les embeddings pré-entrainés existants représentent un mot arabe sans considération des caractéristiques d’agglutination et de la richesse morphologique de l’arabe. L’arabe est une langue caractérisée par son agglutination et sa richesse morphologique. Si on considère que la définition d’un mot, au sens graphique, est une séquence de caractères délimitée par deux séparateurs (blanc ou autre marqueur de séparation, tel que la ponctuation), alors un mot en arabe peut avoir une structure très complexe. En effet, ce mot peut être décomposable en proclitique(s), forme fléchie et enclitique(s). Dans cette perspective, nous supposons qu’une décomposition en éléments simples du mot complexe pourrait réduire améliorer la qualité des embeddings. La dimension des embeddings est égale à 300. Pour plus d’information sur les différents espaces d’embeddings, merci de vous référer à la référence ci-dessous.

2/ Le Lexique polarisé ArSentLex

Il représente une fusion de tous les lexiques de sentiment disponibles à notre connaissance. Cela représente un ensemble de 15 lexiques construits avec différentes méthodes. ArSentLex est défini comme un 5-uplet défini (w, pos, ps, ns, p), où: w est un mot, pos : son étiquette morphosyntaxique, ps : son score de positivité, ns: son score de négativité et p : sa polarité (positive ou négative). Autrement dit, chaque w est décrit par quatre descripteurs : pos, ps, ns et p. ArSentLex contient 51968 mots positifs et 45638 mots négatifs.

Reference : Barhoumi A., Camelin N., Aloulou C., Estève Y., Hadrich Belguith L. (2019) An Empirical Evaluation of Arabic-Specific Embeddings for Sentiment Analysis. In: Smaïli K. (eds) Arabic Language Processing: From Theory to Practice. ICALP 2019. Communications in Computer and Information Science, vol 1108. Springer, Cham

Tunisian Sentiment Analysis Corpus (TSAC) (2017)

Licence: GNU Lesser General Public License v3.0

Lien GitHub: https://github.com/fbougares/TSAC

About 17k user comments manually annotated to positive and negative polarities. This corpus is collected from Facebook users comments written on official pages of Tunisian radios and TV channels namely Mosaique FM, JawhraFM, Shemes FM, HiwarElttounsi TV and Nessma TV. The corpus is collected from a period spanning January 2015 until June 2016.

For the use of TSAC corpus, please consider the following paper :

Reference : Salima Medhaffar, Fethi Bougares, Yannick Estève and Lamia Hadrich-Belguith. Sentiment analysis of Tunisian dialects: Linguistic Ressources and Experiments. WANLP 2017. EACL 2017

Corpus TuDiCoI (Tunisian Dialect Corpus Interlocutor) (2015)

Github Link : https://github.com/MarwaGraja/TuDiCOI.git

In the context of a research project on Automatic comprehension of Arabic spontaneous speech, ANLP-RG¹ (Arabic Natural Language Processing Research Group) is constructing an initial corpus related to Railway Information, in cooperation with the National Company of Railway in Tunisia (SNCFT)². This corpus, called TuDiCoI (Tunisian Dialect Corpus Interlocutor) is a spoken dialogue corpus in Tunisian dialect. It consists of 1825 dialogues recorded in the railway station of Sfax. The main task of the TUDICOI corpus is requesting information in Tunisian dialect about the railway services. These requests are about train schedule consultation, train type, train destination, train path, fare and ticket booking. The corpus consists of 5647 staff utterances and 6528 client utterances. To our knowledge, this corpus is the only available spoken dialogue corpus in Tunisian dialect.

¹http://sites.google.com/site/anlprg

²http://www.sncft.com.tn/

Reference: M. Graja, M. Jaoua, L. Hadrich Belguith, Statistical Framework with Knowledge Base Integration for Robust Speech Understanding of the Tunisian Dialect, IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 12, DECEMBER 2015

Corpus ADQA - Arabic Definition Question Answering corpus (2010)

Lien Github : https://github.com/triguiomar/ADQA

*ArabicListDefQuest : 50 Arabic organization definition questions.

*ArabicCorpusWikipedia : 50 files, each of them contains snippets collected from Wikipedia according to a question from ArabicListDefQuest.

*ArabicCorpusGoogle : 50 files, each of them contains snippets collected from Google according to a question from ArabicListDefQuest.

* ArabicListDefAnsw from -Google+Wikipedia- : 50 Arabic files, each of them contains a list of definition answers extracted from both Google and Wikipedia snippets according to an organization definition question.

*ArabicListDefAnsw from -Google- : 50 Arabic files, each of them contains a list of definition answers extracted from Google snippets according to an organization definition question.

Reference : Omar Trigui, Lamia Hadrich Belguith and Paolo Rosso: DefArabicQA, “Arabic Definition Question Answering System”,Workshop on Language Resources and Human Language Technologies for Semitic Languages, 7^th LREC, May 17^th 2010, Valletta, Malta.

Corpus: AnATAr corpus - An Arabic annotated corpus for anaphora resolution (2009)

This corpus is annotated using AnATAr tool. It consists of a Tunisian book used for basic education. It contains 70 texts, 2892 sections, 18895 words and 2722 pairs of anaphor/antecedent. The corpus is annotated with coreferential links ; mainly the identity relations between the anaphors (pronouns, definite descriptions or proper names) and their antecedents (noun phrases).

Reference : Souha Mezghani Hammami, Lamia Hadrich Belguith, Abdelmajid Ben Hamadou Arabic Anaphora Resolution: Corpora Annotation with Coreferential links. The international Arab Journal of Information Technology, vol. 6, No. 5, pp481-489, November 2009