The Fifth Arabic Natural Language Processing Workshop

(WANLP 2020)

co-located with COLING'2020, Barcelona, Spain, 13-18 Sep. 2020

Workshop Description

Arabic is a challenging language for the field of computational linguistics. This is due to many factors including its complex and rich morphology, its high degree of ambiguity as well as the presence of a number of dialects that vary quite widely. Arabic is also a language with important geopolitical connections. It is spoken by over 400 million people in countries with varying degrees of prosperity and stability. It is the primary language of the latest world refugee problem affecting the Middle East and Europe. The opportunities that are made possible by working on this language and its dialects cannot be underestimated in their consequence on the Arab World, the Mediterranean Region and the rest of the World.

There has been a lot of progress in the last 20 years in the area of Arabic Natural Language Processing (NLP). Many Arabic NLP (or Arabic NLP-related) workshops and conferences have taken place, both in the Arab World and in association with international conferences. Examples include the following:

    • The First, Second, Third , and Fourth Arabic Natural Language Processing Workshop at EMNLP 2014, ACL 2015, EACL 2017, and ACL 2019 respectively.
    • The First, Second, and Third Workshops on Arabic Corpora and Processing Tools at LREC 2014, LREC 2016, and LREC 2018, respectively.
    • The conference on Arabic Language Resources and Tools (MEDAR-2009, NEMLAR-2004).
    • The workshop on Computational Approaches to Semitic Languages (LREC 2010, EACL 2009, ACL 2007, ACL 2005, ACL 2002, ACL 1998).
    • The workshop on Computational Approaches to Arabic Script-based Languages (MTSummit XII 2009, LSA 2007, COLING 2004).
    • The International Symposium on Computer and Arabic Language (ISCAL 2009, ISCAL 2007)

This workshop follows in the footsteps of these efforts to provide a forum for researchers to share and discuss their ongoing work. This workshop is timely given the continued rise in research projects focusing on Arabic NLP.

We invite submissions on topics that include, but are not limited to, the following:

    • Basic core technologies: morphological analysis, disambiguation, tokenization, POS tagging, named entity detection, chunking, parsing, semantic role labeling, sentiment analysis, Arabic dialect modeling, etc.
    • Applications: machine translation, speech recognition, speech synthesis, optical character recognition, pedagogy, assistive technologies, social media, etc.
    • Resources: dictionaries, annotated data, corpus, etc.

Submissions may include work in progress as well as finished work. Submissions must have a clear focus on specific issues pertaining to the Arabic language whether it is standard Arabic, dialectal, or mixed. Papers on other languages sharing problems faced by Arabic NLP researchers such as Semitic languages or languages using Arabic script are welcome. Additionally, papers on efforts using Arabic resources but targeting other languages are also welcome. Descriptions of commercial systems are welcome, but authors should be willing to discuss the details of their work.

Shared Task

Associated with the workshop will be a shared task on Arabic dialect identification. This shared task targets province-level dialects, and as such will be the first to focus on naturally-occurring fine-grained dialect at the sub-country level.

Shared Task Webpage: https://sites.google.com/view/nadi-shared-task


Paper Submission Instructions

Paper Length: Submissions are expected to be up to 8 pages long plus any number of pages for references. Final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers’ comments can be taken into account.

Submission Format: Submissions must be in PDF and prepared using LaTeX. The template will be provided in due time.

Submission Website: submissions are done via softconf. The link will be provided in due time.

Blind Reviewing Policy: The workshop follows a blind reviewing policy. The authors should omit their names and affiliations from the paper and avoid self-references that reveal their identity. Papers that do not conform to these requirements will be rejected without review.

Multiple Submission Policy: Papers that have been or will be submitted to other meetings or publications must indicate this at submission time. Authors must inform organizers immediately once a paper is to be withdrawn from the workshop for any reason. Attempting to publish the same paper or with a major overlap (50%) may lead to rejection of the paper even after an acceptance notification have gone out.

Anonymity and Supplementary Material: As the reviewing will be blind, papers must not include authors' names and affiliations. Furthermore, self-references that reveal the author's identity, e.g., "We previously showed (Smith, 1991) ..." must be avoided. Instead, use citations such as "Smith previously showed (Smith, 1991) ..." Papers that do not conform to these requirements will be rejected without review.

Papers should not refer, for further detail, to documents that are not available to the reviewers. For example, do not omit or redact important citation information to preserve anonymity. Instead, use third person or named reference to this work, as described above (“Smith showed” rather than “we showed”).

Papers may be accompanied by a resource (software and/or data) described in the paper. Papers that are submitted with accompanying software/data may receive additional credit toward the overall evaluation, and the potential impact of the software and data will be taken into account when making the acceptance/rejection decisions.

WANLP 2020 also encourages the submission of supplementary material to report preprocessing decisions, model parameters, and other details necessary for the replication of the experiments reported in the paper. Seemingly small preprocessing decisions can sometimes make a large difference in performance, so it is crucial to record such decisions to precisely characterize state-of-the-art methods.

Nonetheless, supplementary material should be supplementary (rather than central) to the paper. It may include explanations or details of proofs or derivations that do not fit into the paper, lists of features or feature templates, sample inputs and outputs for a system, pseudo-code or source code, and data. The paper should not rely on the supplementary material: while the paper may refer to and cite the supplementary material and the supplementary material will be available to reviewers, they will not be asked to review or even download the supplementary material. Authors should refer to the contents of the supplementary material in the paper submission, so that reviewers interested in these supplementary details will know where to look.

Note: The supplementary material does not count towards page limit and should not be included in paper, but should be submitted separately using the appropriate field on the submission website

Important Dates

  • Dec 1, 2019: First Call for Workshop Papers
  • Mar 18, 2020: Second Call for Workshop Papers
  • May 20, 2020: Workshop Paper Due Date
  • Jun 24, 2020: Notification of Acceptance
  • Jul 11, 2020: Camera-ready Papers Due
  • Sep 13: Workshop Dates

Invited Speaker

Dr. Nizar Habash from the Computer Science department at New York University Abu Dhabi has agreed to be the keynote speaker at the workshop. He will be talking about the latest research and advances in Arabic dialect language processing

Workshop Organizers

General Chair:

      • Imed Zitouni, Google, USA. Email: imed.zitouni AT gmail.com

Program Chairs:

      • Muhammad Abdul-Mageed, UBC, Canada. Email: muhammad.mageed AT ubc.ca
      • Houda Bouamor, Carnegie Mellon University in Qatar. Email: hbouamor AT qatar.cmu.edu
      • Fethi Bougares, University of Le Mans, France. Email: fethi.bougares AT univ-lemans.fr
      • Mahmoud El-Haj, Lancaster University, England. Email: m.el-haj AT lancaster.ac.uk

Publication Chair:

      • Nadi Tomeh, LIPN, Université Paris 13, Sorbonne Paris Cité. Email: tomeh AT lipn.fr

Publicity Chair:

      • Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar. Email: wzaghouani AT hbku.edu.qa

Ex-General Chair / Advisor:

      • Wassim El-Hajj, American University of Beirut, Lebanon. Email: we07 AT aub.edu.lb

Advisory Committee:

      • Muhammad Abdul-Mageed, UBC, Canada. Email: muhammad.mageed@ubc.ca
      • Ahmed Ali, Qatar Computing Research Institute, Qatar. Email: amali@qf.org.qa
      • Hend Alkhalifa, King Saud University, Saudi Arabia. Email: hend.alkhalifa AT gmail.com
      • Houda Bouamor, Carnegie Mellon University in Qatar. Email: hbouamor AT qatar.cmu.edu
      • Fethi Bougares, Le Mans University, France. Email: Fethi.bougares AT gmail.com
      • Khalid Choukri, ELDA, European Language Resource Association, France. Email: choukri AT elda.org
      • Kareem Darwish, Qatar Computing Research Institute, Qatar. Email: kdarwish AT hbku.edu.qa
      • Mona Diab, George Washington University, USA. Email: mtdiab AT gmail.com
      • Mahmoud El-Haj, Lancaster University, UK. Email: m.el-haj AT lancaster.ac.uk
      • Samhaa El-Beltagy, Nile University, Egypt. Email: samhaaelbeltagy AT gmail.com
      • Wassim El-Hajj, American University of Beirut, Lebanon. Email: we07 AT aub.edu.lb
      • Nizar Habash, New York University Abu Dhabi, UAE. Email: nizar.habash AT nyu.edu
      • Lamia Hadrich Belguith, University of Sfax, Tunisia. Email: lamia.belguith AT gmail.com
      • Hazem Hajj, American University of Beirut, Lebanon. Email: hh63 AT aub.edu.lb
      • Walid Magdy, University of Edinburgh, Scotland. Email: wmagdy AT inf.ed.ac.uk
      • Khaled Shaalan, The British University in Dubai, UAE. Email: khaled.shaalan AT buid.ac.ae
      • Kamel Smaili, University of Lorraine, France. Email: kamel.smaili AT loria.fr
      • Nadi Tomeh, University Paris 13, France. Email: tomeh AT lipn.fr
      • Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar. Email: wajdiz AT gmail.com
      • Imed Zitouni, Google, USA. Email: imed.zitouni AT gmail.com

Program Committee Members

The following is the list of PC members, all of whom participated in the review process of the WANLP 2019. A large percentage of them confirmed their willingness to review papers for WANLP 2020.

  1. Ahmed Abdelali, Qatar Computing Research Institute, Qatar
  2. Muhammad Abdul-Mageed , UBC, Canada
  3. Haithem Afli, Cork Institute of Technology, Ireland
  4. Ahmad Al Sallab , Faculty of Engineering, Cairo University
  5. Ahmed Ali, Qatar Computing Research Institute, Qatar
  6. Hend Alkhalifa, King Saud University, Saudi Arabia
  7. Chafik Aloulou, Univeristé de Sfax, Tunisia
  8. Areeb Alowisheq, Imam University, Saudi Arabia
  9. Almoataz Al-Said, Cairo University, Egypt
  10. Nora Al-Twairesh, King Saud University, Saudi Arabia
  11. Salha Alzahrani, Taif University, Saudi Arabia
  12. Walid Aransa, University du Maine, Le Mans, France
  13. Mohammed Attia, George Washington University
  14. Gilbert Badaro, American University of Beirut, Lebanon
  15. Alberto Barrón-Cedeño, Qatar Computing Research Institute, Qatar
  16. Abdelmajid Ben-Hamadou, University of Sfax, Tunisia
  17. Houda Bouamor, Carnegie Mellon University in Qatar
  18. Fethi Bougares, Le Mans University, France
  19. Karim Bouzoubaa, Mohammad V University, Morocco
  20. Tim Buckwalter, University of Maryland, USA
  21. Violetta Cavalli-Sforza, Al Akhawayn University, Morocco
  22. Khalid Choukri, ELDA, European Language Resource Association, France
  23. Kareem Darwish, Qatar Computing Research Institute, Qatar
  24. Abeer Dayel, King Saud University, Saudi Arabia
  25. Mona Diab, George Washington University, USA
  26. Joseph Dichy, Université Lyon 2 , France
  27. Mahmoud El Haj, Lancaster University, UK
  28. Shady Elbassuoni, American University of Beirut, Lebanon
  29. Wassim El-Hajj, American University of Beirut, Lebanon
  30. Ali Elkahky, Google AI
  31. Mariem Ellouze, University of Sfax, Tunisia
  32. AbdelRahim Elmadany, Jazan University, Saudi Arabia
  33. Mohamed Elmahdy, Qatar University, Qatar
  34. Tamer Elsayed, Qatar University, Qatar
  35. Ossama Emam, IBM, USA
  36. Ramy Eskander, Columbia University, USA
  37. Aly Fahmy, Cairo University, Egypt
  38. Ali Farghaly, Monterey Peninsula College, USA
  39. Bilel Gargouri, University of Sfax, Tunisia
  40. Sahar Ghannay, LIUM Laboratory, France
  41. Nada Ghneim, Higher Institute for Applied Sciences and Technology, Syria
  42. Nizar Habash, New York University Abu Dhabi, UAE
  43. Bassam Haddad, University of Petra, Jordan
  44. Lamia Hadrich Belguith, University of Sfax, Tunisia
  45. Hazem Hajj, American University of Beirut, Lebanon
  46. Salwa Hamada, Cairo University, Egypt
  47. Maram Hasanain, Qatar University, Qatar
  48. Mustafa Jarrar, Bir Zeit University, Palestine
  49. Shahram Khadivi, Tehran Polytechnic, Iran
  50. Mohamed Maamouri, Linguistic Data Consortium, USA
  51. Walid Magdy, University of Edinburgh, Scotland
  52. Azzeddine Mazroui, University Mohamed I, Morocco
  53. Seif Mechti, University of Sfax, Tunisia
  54. Salima Medhaffar, Le Mans University, France
  55. Karine Megerdoomian, The MITRE Corporation, USA
  56. Emad Mohamed, Suez Canal University, Egypt
  57. Ghassan Mourad, Lebanese University, Lebanon
  58. Hamdy Mubarak, Qatar Computing Research Institute, Qatar
  59. Preslav Nakov, Qatar Computing Research Institute, Qatar
  60. Alexis Nasr, University of Marseille, France
  61. Abdelsalam Nwesri, University of Tripoli, Libya
  62. Owen Rambow, Columbia University, USA
  63. Mohsen Rashwan, RDI Egypt
  64. Eshrag Refaee, Jazan University, Saudi Arabia
  65. Mohammad Salameh, Carnegie Mellon University, Qatar
  66. Hassan Sawaf, eBay Inc., USA
  67. Khaled Shaalan, The British University in Dubai, UAE
  68. Khaled Shaban, Qatar University, Qatar
  69. Otakar Smrž, Institute of Formal and Applied Linguistics, Charles University in Prague , Czech Republic
  70. Reem Suwaileh, Qatar University, Qatar
  71. Nadi Tomeh, University Paris 13, France
  72. Omar Trigui , University of Sousse, Tunisia
  73. Stephan Vogel, Qatar Computing Research Institute, Qatar
  74. Samantha Wray, Qatar Computing Research Institute, Qatar
  75. Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar
  76. Taha Zerrouki, University of Bouira, Algeria
  77. Imed Zitouni, Microsoft Research, USA
  78. Ines Zribi, Sfax University, Tunisia

Shared Task: Nuanced Arabic Dialect Identification (NADI)

Introduction: Arabic has a widely varying collection of dialects. Many of these dialects remain under-studied due to rarity of resources. The goal of the shared task is to alleviate this bottleneck in the context of fine-grained Arabic dialect identification. Dialect identification is the task of automatically detecting the source variety of a given text or speech segment. Previous work on Arabic dialect identification has focused on coarse-grained regional varieties such as Gulf or Levantine (e.g., Zaidan and Callison-Burch, 2013; Elfardy and Diab, 2013) or country-level varieties such as the MADAR shared task in WANLP 2019 (Bouamor, Hassan, and Habash, 2019). The MADAR shared task also involved city-level classification on human translated data. This shared task targets province-level dialects, and as such will be the first to focus on naturally-occurring fine-grained dialect at the sub-country level. The data covers a total of 100 provinces from all 22 Arab countries and come from the Twitter domain. Evaluation and task set up follows the MADAR 2019 shared task. The subtasks involved include:

    • Subtask 1: Country-level dialect identification: A total of 22,000 tweets, covering all 22 Arab countries. This is a new dataset created for this shared task.
    • Subtask 2: Province-level dialect identification. A total of 22,000 tweets, covering 100 provinces from all 22 Arab countries. This is the same dataset as in Subtask 1, but with province labels.

Unlabeled data: Participants will also be provided with an additional 10M unlabeled tweets that can be used in developing their systems for either or both of the tasks.

Metrics: The evaluation metrics will include precision/recall/f-score/accuracy. Macro Averaged F-score will be the official metric.

Participating teams will be provided with a common training data set and a common development set. No external manually labelled data sets are allowed. A blind test data set will be used to evaluate the output of the participating teams. All teams are required to report on the development and test set in their writeups.

The shared task will be hosted through CODALAB. Teams will be provided with a CODALAB link for each shared task.


Organizers: Muhammad Abdul-Mageed, Chiyu Zhang (The University of British Columbia, Canada), Nizar Habash (New York University Abu Dhabi) , and Houda Bouamor (Carnegie Mellon University, Qatar).