WANLP 2021: The Sixth Workshop for Arabic Natural Language Processing
Arabic has a wide variety of dialects, many of which remain under-studied primarily due to lack of data. The goal of the Nuanced Arabic Dialect Identification (NADI) is to alleviate this bottleneck by affording the community with diverse data from 21 Arab countries. The data can be used for modeling dialects, and NADI focuses on dialect identification. Dialect identification is the task of automatically detecting the source variety of a given text or speech segment. Previous work on Arabic dialect identification has focused on coarse-grained regional varieties such as Gulf or Levantine (e.g., Zaidan and Callison-Burch, 2013; Elfardy and Diab, 2013; Elaraby and Abdul-Mageed, 2018) or country-level varieties (e.g., Bouamor et al., 2018; Zhang and Abdul-Mageed, 2019) such as the MADAR shared task in WANLP 2019 (Bouamor, Hassan, and Habash, 2019). The MADAR shared task also involved city-level classification on human translated data. Abdul-Mageed, Zhang, Elmadany, and Ungar (2020) also developed models for detecting city-level variation. NADI aims at maintaining this theme of modeling fine-grained variation.
NADI targets province-level dialects, and as such is the first to focus on naturally-occurring fine-grained dialect at the sub-country level. The NADI 2020 shared was held with WANLP 2020 (Abdul-Mageed, Zhang, Bouamor, and Habash, 2020). The NADI 2021 shared task will be held with WANLP@EACL2021 and will continue to focus on fine-grained dialects with new datasets and efforts to distinguish both modern standard Arabic (MSA) and dialects (DA) according to their geographical origin. The data covers a total of 100 provinces from all 21 Arab countries and come from the Twitter domain. Evaluation and task set up follow the NADI 2020 shared task. The subtasks involved include:
Subtask 1.1: Country-level MSA identification: A total of 21,000 tweets, covering 21 Arab countries.
Subtask 1.2: Country-level DA identification: A total of 21,000 tweets, covering 21 Arab countries.
Similar to Subtask 1 but focusing on the Province level
Subtask 2.1: Province-level MSA identification: A total of 21,000 tweets, covering 100 provinces.
Subtask 2.2: Province-level DA identification: A total of 21,000 tweets, covering 100 provinces.
Participants will also be provided with an additional 10M unlabeled tweets that can be used in developing their systems for either or both of the tasks.
The evaluation metrics will include precision/recall/f-score/accuracy. Macro Averaged F-score will be the official metric.
Participating teams will be provided with a common training data set and a common development set. No external manually labelled data sets are allowed. A blind test data set will be used to evaluate the output of the participating teams. Each team is allowed a maximum of 3 submissions. All teams are required to report on the development and test sets (after results are announced) in their write-ups.
The shared task evaluation will be hosted through CODALAB.
CODALAB link for NADI Shared Task Subtask 1.1: https://competitions.codalab.org/competitions/27768
CODALAB link for NADI Shared Task Subtask 1.2: https://competitions.codalab.org/competitions/27769
CODALAB link for NADI Shared Task Subtask 2.1: https://competitions.codalab.org/competitions/27770
CODALAB link for NADI Shared Task Subtask 2.1: https://competitions.codalab.org/competitions/27771
Train, development, and test (unlabelled) dataset has already been released to registered participants via email. The evaluation stage is over but you can score your system on the Codalab by the post-evaluation phase.
By downloading the NADI-2021 Shared Task files from HERE, you agree to the terms of the license.
December 15, 2020: Release of training data and scoring script
December 27, 2020: Registration deadline
December 28, 2020: Test set made available
February 4, 2021 February 7, 2021: Codalab TEST system submission deadline
February 5, 2021, February 9, 2021: Shared task system paper submissions due
February 10, 2021 February 20, 2021: Notification of acceptance
February 15, 2021 February 28, 2021: Camera-ready version of shared task system papers due (strict!)
April 19, 2020: Workshop Dates
Note: All deadlines are 11:59 PM UTC-12:00 (Anywhere On Earth).
For any questions related to this task, please contact the organizers directly using the following email address: ubc.nadi2020@gmail.com or join google group: https://groups.google.com/d/forum/nadi_shared_task.
Muhammad Abdul-Mageed, Chiyu Zhang, Abdelrahim Elmadany (The University of British Columbia, Canada), Nizar Habash (New York University Abu Dhabi) , and Houda Bouamor (Carnegie Mellon University, Qatar).