ZAEBUC زئـــــــــبق

Zayed Arabic-English Bilingual Undergraduate Corpus 

Mission: an open bilingual user corpus for research

The Zayed Arabic-English Bilingual Undergraduate Corpus (ZAEBUC) is a project focusing on bilingual users of Arabic and English, and comprising samples of their writing and speech in both their languages. It is estimated that more than half the world’s population use more than one language every day (Grosjean, 2010); and many of these people are literate to some level in more than one language. However, corpus-based research has tended to focus on one language at a time, with any cross-linguistic comparison made only between different communities. We, on the other hand, work towards building integrated corpora that support research on bilingualism. 

ZAEBUC Written Corpus

ZAEBUC Written Corpus is a bilingual writer corpus, matching comparable texts in different languages written by the same writer on different occasions. It currently comprises short essays written by several hundred (mainly Emirati) Freshman students; in total, the corpus currently consists of 388 English essays (~88,000 words) and 214 Arabic essays (~33,000 words).

The corpus is provided in uncorrected and corrected versions, so that errors in spelling and basic sentence grammar can be identified and analyzed. Both Arabic and English texts are also rated by three assessors using the Common European Framework of Reference (CEFR; Council of Europe, 2001). Additionally, the corpus is automatically and manually annotated for part of speech, lemmas and other features. We followed commonly used standards for tokenization, tagging and lemmatizations for Arabic and English to allow the use of the corpus in computational (Marcus et al, 1993; Maamouri et al, 2004). In particular, we used the Universal Dependencies part-of-speech standards as they are designed to maximize comparability between languages (Nivre et al., 2016). Finally, metadata about each writer/text enables researchers to compare subcorpora. The corpus will be made available in a number of formats to accommodate different research communities’ needs, from basic TSV and TXT files to interfaces supported by SketchEngine (Kilgarriff et al., 2014).

ZAEBUC Written Corpus will be an open research resource, aligned with the recent ‘multilingual turn’ in applied linguistics. It enables researchers to answer a range of questions such as “Do students who use more complex constructions in their Arabic writing also tend to use more complex constructions in their English writing?”; “Do male students use different vocabulary from female students when writing in Arabic? And is a similar pattern evident when writing in English?”; or “Is Arabic clearly dominant compared to English in students who studied at an Arabic-medium high school? And is the inverse pattern evident for students who studied at an English-medium high school?”.


Download ZAEBUC Written Corpus

To download the corpus, click here.


Research Team

David M. Palfreyman (UAE University)

Nizar Habash (New York University Abu Dhabi, CAMeL Lab)

ZAEBUC Spoken Corpus

ZAEBUC Spoken Corpus is a multilingual, multidialectal Arabic-English speech corpus. The corpus comprises twelve hours of Zoom meetings involving multiple speakers role-playing a work situation where Students brainstorm ideas for a certain topic and then discuss it with an Interlocutor. The meetings cover different topics and are divided into phases with different language setups. The corpus is multilingual, including two languages (Arabic and English) with Arabic spoken in multiple variants (Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic) and English used with various accents. Adding to the complexity of the corpus, there is also frequent code-switching between these languages and dialects. 

As part of our work, we take inspiration from established sets of transcription guidelines (Gadalla et al., 1997; Richey et al., 2019; Hamed et al., 2020) to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages, which we will make available. The corpus includes: (1) the audio files, (2) manual transcriptions of the recordings, (3) dialectness level annotations for the portion containing code-switching between Arabic variants, and (4) automatic morphological annotations, including tokenization, lemmatization, and part-of-speech tagging. ZAEBUC-Spoken corpus offers a challenging set to ASR systems given its spontaneous conversational speech nature, as well as an interesting setup to examine the interaction between diverse bilingual speakers.


Download ZAEBUC Spoken Corpus

To download the corpus, click here.


Research Team

Injy Hamed (MBZUAI)

Fadhl  Eryani (University of Tübingen)

David M. Palfreyman (UAE University)

Nizar Habash (New York University Abu Dhabi, CAMeL Lab)

Acknowledgements

The creators of this corpus acknowledge the support of this project from the Zayed University Research Incentive Fund (award R19068).

We also extend thanks to Ramy Eskander for helpful discussions and the team of annotators at Ramitechs for their help in creating this resource.


Publications

ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus. Nizar Habash & David Palfreyman. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 79-88, Marseille, 2022.  [PDF]

Bilingual Writers and Corpus Analysis. David M. Palfreyman & Nizar Habash (Eds., 2022). Routledge. 

ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus. Injy Hamed, Fadhl Eryani, David Palfreyman, & Nizar Habash. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2022), pp. 79-88, Marseille, 2022.  [PDF]


References

Council of Europe (2001). Common European Framework of Reference for Languages: learning, teaching, assessment. Cambridge University Press.

Grosjean, F. (2010). Bilingual. Harvard University Press.

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., ... & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1(1), 7-36.

Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004, September). The Penn Arabic Treebank: Building a large-scale annotated arabic corpus. In NEMLAR conference on Arabic language resources and tools (Vol. 27, pp. 466-467).

Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank.

Nivre, Joakim, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald et al. "Universal dependencies v1: A multilingual treebank ." In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp. 1659-1666. 2016. 

Gadalla, H., Kilany, H., Arram, H., Yacoub, A., El-Habashi, A., Shalaby, A., Karins, K., Rowson, E., MacIntyre, R., Kingsbury, P., Graff, D., and McLemore, C. 1997. CALLHOME Egyptian Arabic transcripts. Linguistic Data Consortium, Philadelphia.

Richey, C., D’Angelo, N. A. C., Bratt, H., and Shriberg, E. 2019. SRI speech-based collaborative learning corpus LDC2019S01. Web Download. Philadelphia: Linguistic Data Consortium.

Hamed, I., Vu, N. T., and Abdennadher, S. 2020. ArzEn: A speech corpus for code-switched Egyptian Arabic-English. In Proceedings of LREC, pp. 4237–4246.



Zayed University logo
CAMeL Lab logo
NYU Abu Dhabi logo