The MADAR Project

Multi-Arabic Dialect Applications and Resources

About

MADAR (Multi-Arabic Dialect Applications and Resources) was a three-year joint project among the NLP Group at Carnegie Mellon University in Qatar (CMU-Q), the Computational Approaches to Modeling Language (CAMEL) Lab at New York University Abu Dhabi (NYUAD), and Columbia University. The project also involved collaborators from the University of Bahrain (UoB). The project aimed at improving dialectal Arabic processing by:

  • Developing resources for Arabic Dialect modeling, including the creation of a 25-city multi-dialect lexicon and a 25-city multi-dialect parallel corpus;

  • Developing machine translation systems among dialects, dialects and English, dialects and Standard Arabic; and

  • Developing dialect identification systems that can work on a variety of granularity levels.

The MADAR Project is among largest in scale and depth to date when it comes to working on natural language processing of Arabic dialects.

Members

Alexander Erdmann

Research Assistant

NYUAD

Fadhl Eryani

Research Assistant

NYUAD


Sabit Hassan

Research Associate

CMU-Q

Publications

  • A Spelling Correction Corpus for Multiple Arabic Dialects. Fadhl Eryani, Nizar Habash, Houda Bouamor, and Salam Khalifa. Proceedings of The 12th Language Resources and Evaluation Conference, 2020.
    PDF BibTex

  • The MADAR Shared Task on Arabic Fine-Grained Dialect Identification. Houda Bouamor, Sabit Hassan and Nizar Habash. Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy, 2019.
    PDF BibTex

  • A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance. Alexander Erdmann, Salam Khalifa, Mai Oudah, Nizar Habash and Houda Bouamor. Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, Florence, Italy, 2019.
    PDF BibTex

  • ADIDA: Automatic Dialect Identification for Arabic. Ossama Obeid, Mohammad Salameh, Houda Bouamor and Nizar Habash. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, 2019.
    PDF BibTex

  • Addressing Noise in Multidialectal Word Embeddings. Alexander Erdmann, Nasser Zalmout and Nizar Habash. In Proceedings of Conference of the Association for Computational Linguistics (ACL), Melbourne, Australia, 2018.
    PDF BibTex

  • Unified Guidelines and Resources for Arabic Dialect Orthography. Nizar Habash, Salam Khalifa, Fadhl Eryani, Owen Rambow, Dana Abdulrahim, Alexander Erdmann, Reem Faraj, Wajdi Zaghouani, Houda Bouamor, Nasser Zalmout, Sara Hassan, Faisal Al Shargi, Sakhar Alkhereyf, Basma Abdulkareem, Ramy Eskander, Mohammad Salameh and Hind Saddiki. The International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
    PDF BibTex

  • Fine-Grained Arabic Dialect Identification. Mohammad Salameh, Houda Bouamor and Nizar Habash. Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), Santa Fe, New Mexico, USA, 2018.
    PDF BibTex

  • MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction. Ossama Obeid, Salam Khalifa, Nizar Habash, Houda Bouamor, Wajdi Zaghouani and Kemal Oflazer. The International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
    PDF BibTex

  • The MADAR Arabic Dialect Corpus and Lexicon. Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann and Kemal Oflazer. The International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
    PDF BibTex

  • Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic. Alexander Erdmann, Nizar Habash, Dima Taji and Houda Bouamor. In Proceedings of the Machine Translation Summit, Nagoya, Japan, 2017.
    PDF BibTex

Resources

MADAR Corpus

The MADAR Corpus is a collection of parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and Modern Standard Arabic (MSA). This corpus is created by translating selected sentences from the Basic Traveling Expression Corpus (BTEC) in French and English to the different dialects. The MADAR Corpus will be made available soon to the research community under a non-commercial license. While we only provide the Arabic portions of the corpus, the English parallel text can be acquired directly from the USTAR consortium.

We provide three downloadable versions of the corpus (see Downloads): a parallel corpus version, a version based on the 2019 MADAR Dialect ID Shared Task, and a version with CODA spelling (conventional orthography for dialectal Arabic).

A searchable interface is provided below. (The interface may take a few seconds to load, so please be patient).

MADAR Lexicon

The MADAR lexicon is a collection of 1,045 concepts extracted from the MADAR Corpus defined in terms of triplets of words and phrases from English, French and MSA, along with multiple equivalent dialectal forms covering 25 cities from the Arab World. Each dialectal form includes its CODA orthography and CAPHI phonology. The MADAR Lexicon will be made available soon to the research community under a non-commercial license.

A searchable interface is provided below. (The interface may take a few seconds to load, so please be patient).

  • Dialectal Arabic words are written in the Conventional Orthography for Dialectal Arabic (CODA). The guidelines for CODA are here.

  • Pronunciation of Arabic words is provided in the CAMEL Arabic Phonetic Inventory (CAPHI), a simplified form of the international phonetic alphabet. The CAPHI guidelines are here.

  • If you have any suggestions or fixes for the lexicon, please take some time and complete the suggestion form.

ADIDA

ADIDA is a system for automatic dialect identification for Arabic. It is based on the MADAR Corpus Data. The ADIDA utility is part of the CamelTools open-source Python toolkit.

MADARi

To help collect morphological data for the project, we designed the MADAR Annotation Interface (MADARi). MADARi is a web-based framework consisting of a management interface that allows the lead annotator to upload documents and assign tasks to annotators, and an annotation interface that allows annotators to efficiently add morphological data to their assigned documents.