Arabic NLP
I become interested in Arabic NLP during my period at Essex, as that was one of the research foci of the Computational Linguistics group at Essex since the '90s, thanks to the seminal work of Anne de Roeck and her students, as well as in the Linguistics Department, e.g., due to the work of Louisa Sadler.
Morphological Analysis of Arabic and IR
Between 2001 and 2004 I supervised Abdul Goweder's PhD work on the effect of different types of morphological analysis on Arabic IR (Goweder et al, 2004).
Arabic Summarization
Mahmoud El-Hadj, Udo Kruschwitz and Chris Fox worked on Arabic summarization, and Mahmoud also prepared the datasets for the Arabic multi-document summarization task at MULTILING-2011. Mahmoud, Maha Althobaiti and Ans Alghamdi prepared the datasets for MULTILING-2013
Arabic Information Extraction
In the GALATEAS project, we developed Arabic NLP tools, in particular for language identification and Disambiguation to Wikipedia (D2W) (Lungley et al, 2013).
In her PhD work, Maha Althobaiti has been developing minimally supervised methods for NE extraction in Arabic, including semi-supervised methods (Althobaiti et al, 2013), distant learning methods (Althobaiti et al, 2014b) and methods to combine these using Bayesian Classifier Combination (Althobaiti et al, 2015).
Arabic Social Media
In a recently started collaboration with Minority Rights Group, we are using Arabic NLP techniques to monitor human rights violations reports conveyed through social media, SMS and emails.
Projects (in inverse chronological order)
- Using Text Analytics to support Human Rights Violations reports, a Knowledge Transfer Project funded by the Technology Board (2014-2017) is a collaboration with Minority Rights Group to develop tools to facilitate the collation and analysis of reports of human rights violations produced using social media and / or low-tech solutions including SMSs and emails.
- GALATEAS was a EU-funded project to develop tools to support query log analysis. Our involvement in the project included the development of language identification tools, topic classification tools, and D2W tools in seven EU languages and Arabic.
Main publications
- Ayman Alhelbawy, Mark Lattimer, Udo Kruschwitz, Chris Fox and Massimo Poesio, submitted. An NLP-Powered Human Rights Monitoring Platform.
- Ayman Alhelbawy, Udo Kruschwitz and Massimo Poesio, 2016. Towards a corpus of violence acts in Arabic social media. Proc. of LREC.
- Maha Althobaiti, Udo Kruschwitz, and Massimo Poesio, 2015. Combining Minimally Supervised Methods for Arabic Named Entity Recognition. Transactions of the ACL.. (pdf)
- Althobaiti, M., U. Kruschwitz and M. Poesio, 2014. AraNLP: a Java-based Library for the Processing of Arabic Text. In Proceedings of LREC, Rejkiavik, May.
- Althobaiti, M., U. Kruschwitz and M. Poesio, 2014. Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia. Proc. of EACL, Student Session, Gothenburgh, April.
- Althobaiti, M., U. Kruschwitz and M. Poesio, 2013. A Semi-Supervised Learning Approach to Arabic Named Entity Recognition. In Proceedings of RANLP, Hissar (Bulgaria), September.
- Lungley, D., M. Poesio, M. Trevisan, M. Althobaiti and V. Nguyen, 2013. GALATEAS D2W: A Multi-lingual Disambiguation to Wikipedia Web Service. In Proc. of ENRICH, Dublin, August.
- Abdul Goweder, Massimo Poesio, Anne de Roeck and Jeff Reynolds, 2004. Identifying broken plurals for Arabic Information Retrieval, Proc. of NEMLAR, Cairo, September.
- Abdul Goweder, Massimo Poesio, Anne de Roeck and Jeff Reynolds, 2004 Identifying broken plurals in unvowelized Arabic text, Proc. of EMNLP, Barcelona, July. (pdf).