The SAMER Project
Simplification of Arabic Masterpieces for Extensive Reading
About
The main objective of the SAMER project is to create standards and tools for the simplification of modern fiction in Arabic to school-age learners. The project has developed a five-level prototypical readability scale. It then produced a curated Arabic readability list from a general-purpose corpus of Arabic (half news and half fiction), scale-graded it based on frequency of occurrence in the corpus, then had it manually annotated in triplicate by language professionals from three dialectal regions in the Arab world. In the next stage, the project drew on the generated readability list in designing and publishing a 36k-word Readability-leveled Thesaurus for Arabic and building a Simplification Interface platform as an extension to Google Docs. In the last stage, the system was used to simplify fifteen 10k-word texts from Arabic fictional masterpieces to produce a readability graded corpus of modern Arabic fiction.
The SAMER Project was funded by a New York University Abu Dhabi (NYUAD) Research Enhancement Fund.
Members
Bashar Alhafni, Graduate Research Assistant, NYUAD
Hind Saddiki, Research Assistant, NYUAD
Zhengyang Jiang, Undergraduate Research Assistant, NYUAD
Reem Hazim, Undergraduate Research Assistant, NYUAD
Juan Piñeros Liberato, Undergraduate Research Assistant, NYUAD
Publications & Presentations
Publications
Al Khalil, M., Habash, N., & Saddiki, H. (2017). Simplification of Arabic Masterpieces for Extensive Reading: A Project Overview. In Proceedings of the International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, United Arab Emirates. [PDF]
Al Khalil, M., Saddiki, H., Habash, N., Alfalasi, L. (2018). A Leveled Reading Corpus of Modern Standard Arabic. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC-2018). [PDF]
Saddiki, H., Habash, N., Cavalli-Sforza, V., & Al Khalil, M. (2018). Feature Optimization for Predicting Readability of Arabic L1 and L2. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications. [PDF]
Al Khalil, M., Habash, N., & Jiang, Z. (2020). A Large-Scale Leveled Readability Lexicon for Standard Arabic. The 12th International Conference on Language Resources and Evaluation (LREC 2020, Marseille, France) [PDF]
Jiang, Z. & Habash, N. & Al Khalil, M. (2020). An Online Readability Leveled Arabic Thesaurus. Proceedings of the 27th International Conference on Computational Linguistics. [PDF]
Hazim, R., Saddiki, H., Alhafni, B., Al Khalil, M., Habash, N. (2022). Arabic Word-level Readability Visualization for Assisted Text Simplification. Proceedings of The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Abu Dhabi, United Arab Emirates. [PDF]
Alhafni, B., Hazim, R., Piñeros Liberato, J., Al Khalil, M., Habash, N. (2024). The SAMER Arabic Text Simplification Corpus. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. Turin, Italy. [PDF]
Presentations
Habash, Nizar, Muhamed Al Khalil, Hind Saddiki, Zhengyang Jiang, Reem Hazim, Bashar Alhafni. (2023). Arabic Automatic Readability Resources in the SAMER Project. The 1st workshop on Readability for Low Resourced Languages. [PDF]
Resources
SAMER Readability Lexicon
The SAMER readability lexicon is a large-scale 36,000-lemma leveled readability lexicon for Modern Standard Arabic. The lexicon was manually annotated in triplicate by language professionals from three regions in the Arab world.
Download the SAMER Readability Lexicon here.
SAMER Readability Leveled Arabic Thesaurus
The SAMER Readability Leveled Arabic Thesaurus is an online thesaurus that integrates several Arabic natural language processing tools and databases to provide rich lexical and readability information for a given Arabic word.
Access the online thesaurus here.
SAMER Google Doc Add-On
We built the SAMER Google Doc Add-On to support the simplification of Arabic masterpieces. This tool helps human annotators identify the reading difficulty of an Arabic text by visualizing the word-level readability of the text in a Google Doc.
Install the SAMER Google Doc Add-On by following the instructions here.
SAMER Simplification Corpus
The first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. The corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. The corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels.
Download the SAMER Simplification Corpus here.