Arabic NLP

I become interested in Arabic NLP during my period at Essex, as that was one of the research foci of the Computational Linguistics group at Essex since the '90s, thanks to the seminal work of Anne de Roeck and her students, as well as in the Linguistics Department, e.g., due to the work of Louisa Sadler.

Morphological Analysis of Arabic and IR

Between 2001 and 2004 I supervised Abdul Goweder's PhD work on the effect of different types of morphological analysis on Arabic IR (Goweder et al, 2004).

Arabic Summarization

Mahmoud El-Hadj, Udo Kruschwitz and Chris Fox worked on Arabic summarization, and Mahmoud also prepared the datasets for the Arabic multi-document summarization task at MULTILING-2011. Mahmoud, Maha Althobaiti and Ans Alghamdi prepared the datasets for MULTILING-2013

Arabic Information Extraction

In the GALATEAS project, we developed Arabic NLP tools, in particular for language identification and Disambiguation to Wikipedia (D2W) (Lungley et al, 2013).

In her PhD work, Maha Althobaiti has been developing minimally supervised methods for NE extraction in Arabic, including semi-supervised methods (Althobaiti et al, 2013), distant learning methods (Althobaiti et al, 2014b) and methods to combine these using Bayesian Classifier Combination (Althobaiti et al, 2015).

Arabic Social Media

In a recently started collaboration with Minority Rights Group, we are using Arabic NLP techniques to monitor human rights violations reports conveyed through social media, SMS and emails.

Projects (in inverse chronological order)

  • Using Text Analytics to support Human Rights Violations reports, a Knowledge Transfer Project funded by the Technology Board (2014-2017) is a collaboration with Minority Rights Group to develop tools to facilitate the collation and analysis of reports of human rights violations produced using social media and / or low-tech solutions including SMSs and emails.
  • GALATEAS was a EU-funded project to develop tools to support query log analysis. Our involvement in the project included the development of language identification tools, topic classification tools, and D2W tools in seven EU languages and Arabic.

Main publications