Arabic corpora

Web-based (Searchable) corpora

The Quranic Arabic Corpus

An annotated linguistic resource which shows the Arabic grammar, syntax and morphology for each word in the Holy Quran. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology.


The Arabic Corpus provides information on word frequency and allowing user to find larger structures and grammatical patterns. Words can be searched in Arabic or Latin scripts. The website provides detailed instructions on the search. Registration is recommended.

التونسية: Tunisian Arabic Corpus

A free online corpus of Tunisian Arabic. It contains 2,000 texts, comprising 818,310 words classified into 17 categories: Blogs, conversation (phone), Internet forums, jokes... A researcher can search for a word by three options: exact, stem, or regEx (transliteration).

The Arabic Learner Corpus

It provides numerous written and spoken samples produced by L2 and heritage learners of Arabic in Saudi Arabia. These samples were transcribed into a database with cross-referenced categories according to level (beginning, intermediate, advanced), learner (L2 vs. heritage), and genre (description, narration, instruction).

WebCorp Live

It contains a large collection of Web texts from which examples of real language use can be extracted. It depends on search engines (e.g., Google) to take a list of URLs and extracting concordance lines from each of these URLs. It has many languages to search, one of which is Arabic.

Leeds Arabic Internet Corpus: Querying Internet corpora

Users can use the Arabic interface to make concordances sorted by document, frequency, lemma, or word, then by left or right. This corpus can be used in retrieving collocates as well by assigning the numbers of words came after and/or before.

International Corpus of Arabic

The ICA covers about 100 million Arabic words, from 2006 to 2013, extracted from numerous sources (newspapers, Web articles, books.. etc.) and numerous genres (literature, politics, sciences… etc.). It relies on "Tim Buckwalter" to perform the morphological analysis where the analysis lists number of information such as prefix(s), suffix(s), word class, stem, lemma, root, stem pattern as well as number, gender and definiteness according to the different contexts of the words within the corpus.

Sketch Engine

Sketch Engine is a commercial multi-languages corpus.  In addition to that users can perform Arabic search, they can build and manage their corpus and then extract concordances, word lists, collocates and keywords.


Qurany corpus is augmented with an ontology or index of key concepts, taken from a recognized expert source. Expert knowledge used in annotating the Quran corpus is obtained from 'Mushaf Al Tajweed'. It is the only tool that allows users to search the Quran corpus for abstract concepts via an ontology browser. The 'Mushaf Al Tajweed' contains a comprehensive hierarchical index or ontology of nearly 1200 concepts in the Quran. Scholars can use the Qurany ontology browser to identify a precise concept and find the verses which allude to this concept, with higher precision.

Quran Concordance

For making concordances for Quranic texts. But, it takes a long time to show results.

Quranic Word Co-occurrence

For extracting Quranic collocates that co-occurre with a given word. But, it takes a long time to show results.

N-gram Search

To retrieve collocates from Quran.

 Arabic Concordancer المنقب العربيA corpus that build at the International Islamic University of Malaysia. It claims that it contains 14 million Arabic words harvested from different academically online resources; i.e, online theses, journal articles, conferences papers... etc.
 Maskouk مسكوك A free online tool for retrieving Arabic collocates.
Islamic Law MaterializedA corpus of primary sources for Islamic law and legal practice in pre-modern Muslim societies. This online presentation is the first ever collection of scattered editions of legal documents from the 2nd/8th to the 9th/15th century, often with improved readings compared to earlier print versions. Documents are presented with the Arabic text in modern spelling and with full bibliographical data.
A Digital Corpus for Graeco-Arabic StudiesIt assembles a wide range of Greek texts and their Arabic counterparts. It also includes a number of Arabic commentaries and important secondary sources. The texts in the corpus can be consulted individually or side by side with their translation. The majority of texts can also be downloaded for further analysis.
APD : The Arabic Papyrology Database

It contains editions of Arabic documents written on different material such as papyrus, parchment or paper. These editions are an often unraised treasure for almost every aspect of Islamic history up to the 16th c. A.D. It comprises a total of 11772 documents. It allows searching for documents and read their text in different layers. The tool 'Search' accesses single words or combinations of words - perfect for investigating linguistic peculiarities. The tool 'Lexicon' gives direct access to all of the lexicon, incl. Greek, Coptic, etc. words. Each document is also provided with its metadata, amongst others place and date of origin or its genre as for instance a contract of lease or a petition.

King Abdulaziz City for Science and Technology (KACST) Arabic CorpusIt is a freely available Arabic corpus that helps in various research purposes including Natural Language Processing. It contains more than one billion words. Also, it covers written texts of Classical Arabic and Modern Standard Arabic only, from Pre-Islam era till the launching of this corpus.

Textual corpora (Text Files)


Texts here have been extracted from thousands of articles which had been downloaded from Akhbar Al Khaleej, an online newspaper. The corpus contains more than 5000 articles which correspond to nearly 3 million words. Punctuation has been deleted on purpose. For more information, check the works based on Khaleej-2004 corpus.

Watan-2004 corpus

Watan-2004 corpus contains about 20000 articles talking about the six following topics "categories": Culture, Religion, Economy, Local News, International News and sports. In this corpus, punctuation has been omitted intentionally in order to make it useful for Language Modeling.

Corpus of Contemporary Arabic (CCA)

Texts in this corpus are mainly derived from websites. For the spoken files, which are very small, they are obtained from radio Qatar. There are 15 genres/categories for the written texts, as well as 3 genres/categories for the spoken ones.

King Saud University Corpus of Classical Arabic

KSUCCA texts are classifies into six folders representing the main genres of the corpus; Religion, Linguistics, Literature, Science, Sociology, Biography. They claim that it constitute 50000000 words.


KALIMAT is an Arabic natural language resource that fall into six categories: culture, economy, local-news, international-news, religion, and sports. It contains 4000000 words (20,291 articles).

the UN Parallel Corpus

It composes of official records and other parliamentary documents of the United Nations that are in the public domain. These documents are mostly available in the six official languages of the United Nations. The current version of the corpus contains content that was produced and manually translated between 1990 and 2014, including sentence-level alignments.

Corpus of Contemporary Arabic(CCA) (Written Corpora)

It is mainly derived my texts from websites. It includes 842684 words and 415 texts in some of the categories identified by the language teachers and language engineers. Some spoken files obtained from radio Qatar are also included.

Arabic Corpus

The Arabic Corpus, compiled by Dr. Mourad Abbas, freely contains 5690 documents of Khaleej-2004 divided to 4 topics (categories) and 20291 documents of Watan-2004 organized in 6 topics (categories).

Ajdir Corpora

It freely contains 113 million words, 800 Mb, of journals.

Tashkeela: Arabic Vocalized text corpus

It contains Arabic text vocalized. Text format; 76 million words extracted from Al-Shamela library.

Open Source Arabic Corpora (OSAC)

Freely consists of 4,102,134 tokens derived from BBC Arabic and CNN Arabic websites.

Arabic Words Corpora

It consists of 1.5 million words, 5.4 Mb (zip).

KHATT: Handwritten Arabic Text (paid)

It was developed by King Fahd University of Petroleum & Minerals, Technical University of Dortmund and Braunschweig University of Technology. It is comprised of scanned Arabic handwriting from 1,000 distinct male and female writers representing diverse countries, age groups, handedness, and education levels. Participants produced text on a topic of their choice in an unrestricted style. KHATT was designed to promote research in areas such as text recognition and writer identification.

 OPUS the open parallel corpus It is a growing collection of translated texts from the web. OPUS is based on open source products and the corpus is also delivered as an open content package. 
QED Corpus (formerly QCRI AMARA Corpus) It is an open multilingual (parallel) collection of subtitles for educational videos and lectures collaboratively transcribed and translated over the AMARA web-based platform. The current release of the corpus v1.4 contain 20 languages distributed over 44620 files.
Arabic Gigaword (paid)It is produced by Linguistic Data Consortium (LDC) at the University of Pennsylvania for newswire text data. Four distinct sources of Arabic newswire are represented here: Agence France Presse (afa), Al Hayat News Agency (alh), Al Nahar News Agency (ann), Xinhua News Agency (xin). This corpus is available for fees; however, a scholarship program provides eligible students with no-cost access to LDC data.
WIT3 Web Inventory of Transcribed and Translated TalksIt is a ready-to-use version for research purposes of the multilingual (parallel) transcriptions of TED talks.
Multi-Modal Arabic Corpus 
Arabic Speech CorpusThis Speech corpus has been developed as part of PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.