ArzEn Resources
ArzEn Resources
The ArzEn project contains multiple corpora that can be used for a wide range of NLP tasks, including speech recognition, machine translation, speech translation, morphological segmentation, and assessing ASR evaluation metrics. The corpora were collected in collaboration with multiple institutes, including QCRI.
ArzEn Speech Corpus
Applications: speech recognition
ArzEn Speech Corpus is a 12-hours spontaneous conversational speech corpus containing a considerable amount of code-switching. The recordings were collected through 38 informal interviews with Egyptian Arabic-English bilinguals. The interviews were held at the German University in Cairo. The interviewers and interviewees discussed general topics such as career, studies, hobbies, as well as work and traveling experiences. The recordings were segmented and transcribed by human transcribers. The corpus is divided into train, dev, and test set, where the split is done taking into consideration having balanced dev and test sets in terms of gender distribution, number of interviews, duration, wpm and CS metrics. More information about the corpus can be found in the following publications:
Hamed, Injy, Ngoc Thang Vu, and Slim Abdennadher. "Arzen: A speech corpus for code-switched Egyptian Arabic-English." In Proceedings of the 12th Language Resources and Evaluation Conference (2020).
Hamed, Injy, Pavel Denisov, Chia-Yu Li, Mohamed Elmahdy, Slim Abdennadher, and Ngoc Thang Vu. "Investigations on speech recognition systems for low-resource dialectal Arabic–English code-switching speech." Computer Speech & Language 72 (2022).
Corpus Example:
Audio recording
Transcription: I acquired a lot of knowledge يعني عن ال+networks بشكل عام من ال+project ده
The corpus can be downloaded from here.
ArzEn-ST Corpus
Applications: machine translation, speech translation
In ArzEn-ST corpus, we provide Egyptian Arabic as well as English translations for the transcriptions in the ArzEn Speech Corpus. The translation guidelines as well as the results for benchmark baseline results for speech recognition, machine translation, and speech translation tasks can be found in the following publication:
Hamed, Injy, Nizar Habash, Slim Abdennadher, and Ngoc Thang Vu. "ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English." In Proceedings of the 7th Arabic Natural Language Processing Workshop (2022).
Corpus Example:
Transcription: I acquired a lot of knowledge يعني عن ال+networks بشكل عام من ال+project ده
EGY translation: اكتسبت معرفة كبيرة يعني عن الشبكات بشكل عام من المشروع ده
EN translation: Actually, I acquired a lot of knowledge about the networks in general based on this project.
This corpus was funded by DAAD (German Academic Exchange Service).
The corpus can be downloaded from here.
ArzEnSEG
Applications: morphological segmentation
In ArzEn Surface Segmentation (ArzEnSEG) Corpus, we provide surface segmentation morphological annotation for the first 500 lines of ArzEn dev set. The segmentation of Arabic and English words was performed by two bilingual speakers who collaborated on initial annotations and quality checks. For Arabic segmentation, we follow the Arabic Treebank (ATB) segmentation scheme. For more details about the segmentation guidelines, please refer to the following publication:
Gaser, Marwa, Manuel Mager, Injy Hamed, Nizar Habash, Slim Abdennadher, and Ngoc Thang Vu. "Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text." In Proceedings of EACL (2023).
Corpus Example:
Input: it depends بصراحة بالنسبالي ع ال situation
Segmentation: it depend#s ب#صراحة ب#النسبا#ل#ي ع ال situation
HAC: Human Acceptability Corpus for Code-switching
Applications: assessing ASR evaluation metrics
The HAC corpus consists of ASR hypotheses for a subset of ArzEn speech corpus, along with post-editing annotations performed by bilingual speakers to correct the hypotheses. The corpus contains 1,301 utterances from ArzEn speech corpus, where for each utterance we obtain the ASR hypotheses from three different ASR systems, resulting in 3,903 hypotheses. The 1,301 utterances were obtained from 7 interviews from ArzEn train set, covering 2 hours of speech. The 3,903 hypotheses were annotated for minimal post-editing, where the annotators were asked to perform minimal edits to make the hypotheses acceptable. The guidelines provided for the annotators can be found here. The minimal edits annotations provide a reference human acceptability corpus that can then be used to obtain ground truth error rates for the ASR hypotheses. For more information about the HAC corpus, please refer to the following publication:
Hamed, Injy, Amir Hussein, Oumnia Chellah, Shammur Chowdhury, Hamdy Mubarak, Sunayana Sitaram, Nizar Habash, Ahmed Ali. "Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition." In Proceedings of SLT (2022).
This work was carried out during the 2022 Jelinek Memorial Summer Workshop on Speech and Language Technologies at Johns Hopkins University, which was supported with funding from Amazon, Microsoft, and Google.
Corpus Example:
Reference: I have to say اخر سفرية لي اللي هي في المانيا علشان اشتغلت فيها كويس اوي و كانت يعني كانت very fruitful.
Hypothesis: فلvery fruit أي هفتو ساي آخر سفرية ليا اللي هي في ألمانيا علشان اشتغلت فيها كويس قوي وكانت يعني كانت
Minimal Edit: very fruitful أي هاف تو ساي آخر سفرية ليا اللي هي في ألمانيا علشان اشتغلت فيها كويس قوي وكانت يعني كانت