The SAMER Project

Simplification of Arabic Masterpieces for Extensive Reading 


The main objective of the SAMER project is to create standards and tools for the simplification of modern fiction in Arabic to school-age learners. The project has developed a five-level prototypical readability scale. It then produced a curated Arabic readability list from a general-purpose corpus of Arabic (half news and half fiction), scale-graded it based on frequency of occurrence in the corpus, then had it manually annotated in triplicate by language professionals from three dialectal regions in the Arab world. In the next stage, the project drew on the generated readability list in designing and publishing a 36k-word Readability-leveled Thesaurus for Arabic and building a Simplification Interface platform as an extension to Google Docs. In the last stage, the system was used to simplify fifteen 10k-word texts from Arabic fictional masterpieces to produce a readability graded corpus of modern Arabic fiction.

The SAMER Project was funded by a New York University Abu Dhabi (NYUAD) Research Enhancement Fund.


Publications & Presentations




SAMER Readability Lexicon

The SAMER readability lexicon is a large-scale 36,000-lemma leveled readability lexicon for Modern Standard Arabic. The lexicon was manually annotated in triplicate by language professionals from three regions in the Arab world. 

Download the SAMER Readability Lexicon here.

SAMER Readability Leveled Arabic Thesaurus

The SAMER Readability Leveled Arabic Thesaurus is an online thesaurus that integrates several Arabic natural language processing tools and databases to provide rich lexical and readability information for a given Arabic word.

Access the online thesaurus here.

SAMER Google Doc Add-On

We built the SAMER Google Doc Add-On to support the simplification of Arabic masterpieces. This tool helps human annotators identify the reading difficulty of an Arabic text by visualizing the word-level readability of the text in a Google Doc.

Install the SAMER Google Doc Add-On by following the instructions here.

SAMER Simplification Corpus

The first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. The corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. The corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. 

Download the SAMER Simplification Corpus here.