The AMIYA shared task will offer a chance for researchers to demonstrate innovations and improvements in language modeling of dialectal Arabic.
❗Don't forget to register your team using the registration form. ❕
In the discussion of dialects and related language varieties, a topic of utmost relevance is Arabic. Some regard Arabic as a single language with a vast diversity of dialects, and others regard it as a clade of distinct but related languages. Regardless, Arabic language varieties are spoken by over 400M people
It is commonly known that standard LLMs are more proficient in Modern Standard Arabic (MSA) than Dialectal or Colloquial Arabic (DA). Because Colloquial Arabic varieties have fewer computational resources than MSA and other high-resource languages, building LLMs that support DA has become a recent focus of the research community. We present the first shared task for Dialectal Arabic Language Modeling: Arabic Modeling In Your Accent (AMIYA).
In the task, we will ask participants to contribute LLMs trained or adapted for DA. These will be evaluated using the AL-QASIDA benchmark (Robinson et al., 2025), an evaluation suite that comprehensively measures an LLM’s dialectal fidelity, understanding, generation quality, and MSA-DA diglossia in DA.
We are accepting submissions in three tracks: (1) closed data, (2) closed models, and (3) open. We are officially accepting submissions for Arabic varieties from the following countries:
Morocco
Egypt
Palestine
Syria
Saudi Arabia
Teams who wish to create systems for additional varieties may seek approval by contacting the task organizers. Below we detail the different submission tracks.
In this track, teams will be allowed to use any fully open-source LLMs as a basis for their Dialectal Arabic LLMs. However, they will only be allowed to fine-tune these LLMs with the train data that we will provide (by the end of November).
In this track, teams cannot use pre-trained LLMs and must train their LLMs from scratch. (Using the model config from an existing model with a random initialization is okay.) However, they will be able to use any data sources for training, in addition to the data we provide---with the exception of any off-limits datasets listed below.
In this track, teams may use any pre-trained models and data sources (again with the exception of any off-limits datasets) to develop their DA LLMs.
Use of the following datasets for system training or development is not permitted, as they may be included in our evaluation:
Any eval data used for the PalmX shared task
Any FLORES devtest data
Any MADAR-26 data that is part of the corpus-6-test-corpus-26-test split
Please do not use the following datasets without first checking with the task organizers:
⚠️ Additionally, please use the datasets already used in AL-QASIDA by default only for dev / tuning and NOT for training. Note the AL-QASIDA repo has been updated to be used for dev (i.e. to avoid the off-limits datasets listed above).
To submit systems for evaluation, teams will be required to upload a model to HuggingFace and send the HuggingFace link to the task organizers.
We will evaluate submissions using the AL-QASIDA benchmark and will recognize teams that can maximize any of the following metrics:
ADI2 dialect fidelity score, on both monolingual, cross-lingual, and translation prompts (see Robinson et al., 2025)
chrF++ translation score, on both DA-to-English, English-to-DA, DA-to-MSA, and MSA-to-DA translation
Human scores for fluency and adherence to DA instructions
The baseline model for comparison will be Llama-3.1 (8B).
To register as AMIYA shared task participants, please submit this form.
30 November, 2025: Release of official training data and scaffold code for a minimal submission
15 December, 2025: Registration deadline, eval data finalized
10 January, 2025: System submission deadline
20 January, 2025: System description paper submission deadline
TBD: Camera-ready paper due
24-29 March, 2026: VarDial workshop at EACL in Rabat, Morocco
Nathaniel R. Robinson (Johns Hopkins University)
Shahd Abdelmoneim (Cohere Labs Community)
Kelly Marchisio (Cohere)
Anjali Kantharuban (Carnegie Mellon University)
Kenton Murray (Johns Hopkins University)
Teams are encouraged to use whatever means necessary to improve model performance, including any of the following:
Prompt engineering
Fine-tuning
Training from scratch
Because prompt engineering is the lowest-cost strategy, we will provide a code scaffold implementing a minimally acceptable submission once tailored to a specific DA variety, which will focus on prompt engineering techniques to improve dialectal modeling. We expect, however, that model training or fine-tuning will lead to more competitive submissions.
[Coming soon: scaffold code, tutoring resources]