In the Bahrain Corpus, we aimed to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.).
The texts that comprise this corpus (whether originally written, or transcripts of TV shows) vary from a retelling of folktales, songs, recipes of traditional Bahraini dishes, TV series reenacting stories from older times and older ways of speaking, interviews with prominent Bahraini personalities, as well as texts that include social commentary.
At the time of this publication, the corpus comprises 620K words, carefully curated. We also enrich the Bahrain Corpus text with automatic morphological annotations using state-of-the-art morphosyntactic disambiguation for Gulf Arabic. We validate the quality of the annotations on a 7.6K word sample. We make the full corpus as well as the annotated sample publicly available to support researchers interested in Arabic NLP.
Dana Abdulrahim (PI, University of Bahrain)
Latifa Shamsan (University of Bahrain)
Salam Khalifa (Stony Brook University)
Nizar Habash (New York University Abu Dhabi, CAMeL Lab)
We would like to thank all of the students and volunteers who helped us create this resource over the last few years. We would also like to especially acknowledge the generous contribution of Dr. Anisa Fakhroo who provided the authors with one of her latest publications on Bahraini folktales to be included in the Bahrain Corpus.
The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic. Dana Abdulrahim, Go Inoue, Latifa Shamsan, Salam Khalifa, and Nizar Habash. Proceedings of The 13th Language Resources and Evaluation Conference (LREC2022), pages 2345-2352, Marseille, France. 2022. European Language Resources Association (ELRA).