ArzEn: Corpus of Egyptian Arabic-English Code-switching
مكنز اللهجة المصرية الخليطة بالإنجليزية

Code-switching , i.e. the use of multiple languages within the same discourse, has become a worldwide phenomenon, that is prevalent in Egypt. In the ArzEn project, we aim at collecting corpora containing code-switched Egyptian Arabic-English text and speech. The corpus is intended for linguistic investigations and Natural Language Processing tasks. The project is a collaboration between New York University Abu Dhabi, Stuttgart University, and The German University in Cairo.

ArzEn corpus is a spontaneous conversational speech corpus, obtained through informal interviews held at the German University in Cairo. The participants discussed broad topics, including education, hobbies, work, and life experiences. The corpus currently contains 12 hours of speech, having 6,216 utterances. The recordings were transcribed and translated into monolingual Egyptian Arabic and monolingual English. We discuss our transcription and translation guidelines and present results for benchmark systems for automatic speech recognition, machine translation, and speech translation tasks. The corpus is diverse in terms of the code-switching phenomenon involved, covering the main types of code-switching: inter-sentential code-switching (on the sentence-level), extra-sentential and intrasentential code-switching (on the word-level), and intra-word code-switching (on the morpheme-level). The corpus is publicly available to support and motivate further research in the area of code-switching.


Publications

  • Hamed, Injy, Ngoc Thang Vu, and Slim Abdennadher. "Arzen: A speech corpus for code-switched Egyptian Arabic-English." In Proceedings of the 12th Language Resources and Evaluation Conference (2020).

  • Hamed, Injy, Pavel Denisov, Chia-Yu Li, Mohamed Elmahdy, Slim Abdennadher, and Ngoc Thang Vu. "Investigations on speech recognition systems for low-resource dialectal Arabic–English code-switching speech." Computer Speech & Language 72 (2022).

  • Hamed, Injy, Nizar Habash, Slim Abdennadher, and Ngoc Thang Vu. "ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English." In Proceedings of the 7th Arabic Natural Language Processing Workshop (2022).

  • Hamed, Injy, Nizar Habash, Slim Abdennadher, and Ngoc Thang Vu. "Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation." arXiv preprint arXiv:2205.12649 (2022).

  • Gaser, Marwa, Manuel Mager, Injy Hamed, Nizar Habash, Slim Abdennadher, and Ngoc Thang Vu. "Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text." arXiv preprint arXiv:2210.06990 (2022).

  • Hamed, Injy, Amir Hussein, Oumnia Chellah, Shammur Chowdhury, Hamdy Mubarak, Sunayana Sitaram, Nizar Habash, Ahmed Ali. "Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition." In Proceedings of SLT (2022).