Resources

Corpora

1- Wikipedia Derived Corpora (WDC)

The WDC dataset contains around 6 million tokens collected and annotated automatically from Arabic Wikipedia. We exploit Wikipedia features and structure to automatically develop WDC dataset. So, each Wikipedia link is transformed into an NE type of the target article in order to produce the NE annotation. Other Wikipedia features - namely redirects, anchor texts, and inter-language links - are used to tag additional NEs, which appear without links in Wikipedia texts.
WDC dataset adheres to the CoNLL 2003 annotation guidelines and CoNLL NE types which include Person, Location, Organisation, and Miscellaneous.
The annotation style of the WDC dataset followed the CoNLL format, where each token and its tag are placed together in the same file in the form:
< token > \s < tag >.
The NE boundary is specified using the BIO representation scheme, where:
B- indicates the beginning of the NE,
I refers to the continuation (Inside) of the NE,
and O indicates that the word is not a NE.

 Please cite our paper in any published work using this resource:
@inproceedings{Althobaiti14Automatic,
  title={{Automatic Creation of Arabic Named Entity Annotated Corpus Using  Wikipedia}},
  author={M. Althobaiti and U. Kruschwitz and M. Poesio},
  booktitle={Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association  for Computational Linguistics (EACL)},
  year={2014},
  pages={106--115},
  address = {Gothenburg}
}

Download WDC corpus


Software

1- AraNLP

AraNLP library is a Java-based toolkit for the processing of Arabic text. It supports the most important preprocessing steps, such as diacritic and punctuation removal, tokenization, sentence segmentation, part-of-speech tagging, root stemming, light stemming, and word segmentation. These tools are usually required to prepare the text for more advanced NLP tasks.
The goal of AraNLP is to gather most of the vital Arabic text preprocessing tools into one library that can be accessed easily. Therefore, We incorporated missing tools and included existing algorithmic resources.
AraNLP has already been used in many experiments to prepare the Arabic text and it successfully preprocessed the corpus.

Please cite our paper in any published work using this resource:
@inproceedings{Althobaiti14AraNLP,
  title={{AraNLP: a Java-Based Library for the Processing of Arabic Text}},
  author={M. Althobaiti and U. Kruschwitz and M. Poesio},
  booktitle={Proceedings of the 9th Language Resources and Evaluation Conference (LREC)},
  year={2014},
  address = {Reykjavik}
}

Download AraNLP