Arabic Natural Language Processing

We have done research in the analysis of Arabic text, in particular, the Egyptian dialect. Two major lines of work are pursued:

  1. The analysis of Arabic lyrical work in order to understand the evolution of the Arabic, particularly Egyptian, songs over the decades. We started with one of the most famous Arab singers in the twentieth century, namely Abdel Halim Hafez (عبد الحليم حافظ).

  2. Analysis of the Egyptian uprising on Jan 25th, 2011, from the perspective of foreign journalism.

Lyrics Analysis of the Arab Singer Abdel ElHalim Hafez

In this work we analyze the lyrics of one of the most famous and influential Arab artists in the twentieth century, namely, عبد الحليم حافظ (Abdel Halim Hafez). Lyrics analysis provides a deep insight into the artist’s career evolution, his interactions with the surrounding environment including the social, political, and economic conditions. In order to perform such analysis we had to collect and compile the lyrics of Abdel Halim accompanied with the necessary metadata into an organized and structured form. The data are preprocessed by removing stop words and doing some normalization operations over the songs prose. We did not perform any lemmatization or stemming as the original form of the tokens convey much more information than the source words. We performed a lexical analysis in order to study both the lexical density and diversity over the course of Abdel Halim's career life. We have as well studied the most significant words, idioms, and terms played in the songs using tools such as word clouds and more quantitative measures such as term frequency-inverse document frequency. We have divided the career life of Abdel Halim into sub-decades of length 5 years and all analyses are done both in a yearly fashion and more coarsely over such sub-decades. We have found a strong correlation between our statistical analysis and the socio-political status in Egypt and the Arab world during that time. This is especially relevant knowing that Abdel Halim is very much truly considered the son of the generation of the 1952 revolution in Egypt. The significance of Abdel Halim and his lyrics stem essentially from being contemporaneous to radical changes in Egypt across all sectors including political (support of liberation movements across the world, and the conflict with Israel), and socio-economic (especially changing the social class structure in Egypt). We also investigated the potential effectiveness of PoS (Part of Speech) tagging in genre analysis and classification.

The contributions of the work can be summarized as follows:

  1. Creation of a dataset of the lyrics of the Egyptian and Arab artist Abdel Halim Hafez (عبد الحليم حافظ). The lyrics are associated with the meta-data including the personnel contributing to the creation of the corresponding songs. The dataset includes as well the audio tracks of the lyrics.

  2. Creating a comprehensive dataset of Arabic stop words. We have gathered a very large collection of stop words from multiple diverse sources. Our list contains up to 11,686 Arabic stop words. This gives more credibility and significance to our results, analysis, and conclusions.

  3. Performing lexical analysis over the lyrics to derive conclusions about the lexical diversity and lexical density about the lyrics prose. This analysis is performed in a temporal fashion over the career course of Abdel Halim. One important goal of such analysis, as well as for songwriters, is to know whether there is a correlation between word frequency and hit songs. One result of such analysis indicates that Abdel Halim was a kind of modernist in the sense that he used to favor short songs with an average song length of 279. This is also manifested in the fact that the lyrical words themselves are relatively short with average about 4.7 characters. On the contrary, there had been a decreasing trend of lexical density in Abdel Halim’s songs over his career life. Another conclusion is that lyrics lengths is positively correlated with Abdel Halim’s progression in his career life. So his longest songs were performed in later stages in the 1970s with songs reaching about 1800 words.

  4. Studying the most significant words, terms, and idioms used in the lyrics of Abdel Halim applying tools such as word clouds and numerical characterizations such as TFIDF. This analysis is performed in a temporal fashion over the career course of Abdel Halim. The terms with the largest TFIDF have been shown to be words that carry nationalistic and patriotic emotions, this is in contrast to the common intuition that Abdel Halim was, specially to new generations, just a romantic artist.

  5. Using visualizations, such as word clouds, bar charts, density estimation plots, to aid the analysis and inference from the lyrics.

  6. Correlating the results of such analyses to both the subjective nature of Abdel Halim himself and to the external sociopolitical status of Egypt and the Arab world across his career life.

  7. Performing a preliminary investigation of the effectiveness of PoS (Part of Speech) tagging in genre analysis and classification. This is done by clustering the feature vectors extracted from the lyrics using a ‘bag of PoS tags’. The constituent songs of each cluster are analyzed in terms of their characteristics and similarities.

Sample of songs with lyrics containing the subword ‘love’ (حب).

Sample of songs with lyrics containing the subword ‘nation’ (وطن).

Distribution of lyrics lengths.

Evolution of lyrics lengths over the decades.

Topmost words with respect to word frequency.

Word cloud based on word count in the whole corpus.

Top words at every decade.

Distribution of word length of Abdel Halim’s Lyrics. Shown is the histogram overlaid with a Gaussian distribution fitting.

Histograms of word lengths for individual decades.

Lexical diversity over the years.

Lexical density over the years.

Rarest and most popular words with respect to TFIDF.

TFIDF statistics.

Words with highest TFIDF per decade.

Conclusion

In this work we have initiated the study of Arabic lyrics. In particular, we take as a case study the lyrics of Abdel Halim Hafez who was one of the iconic singers in Egypt and the Arab World in the twentieth century. Lyrics analysis reveals much about the singer inner emotions and belief system as well as revealing much about the political and socioeconomic status about the society towards which the artistic work is directed. There is no more evidence to that as exemplified by Abdel Halim as he was very much representative of the radical political and social changes occurring in Egypt and the whole of the Arab world, and even the whole of the third world, from the early 1950s till the end of 1970s which spanned almost the whole of his career life. We have used a diverse set of tools for analyzing the lyrical work of Abdel Halim. This includes simple word frequency analysis, as well as more advanced techniques such as lexical diversity, lexical density, Term frequency/Inverse document frequency. We have also used a multitude of visuals including histograms, bar plots, and word clouds. We have derived observations and conclusions about the career of Abdel Halim Hafez and how his songs reflected the societal changes occurring at different (critical) periods of time.

There are several directions to pursue in the future. First, we can use more analytical tools to study the lyrics of Abdel Halim such as sentiment analysis and emotion analysis. We can as well study the music and audio features in combination with the lyrics. This is to be repeated for other Arab singers in old and recent times and use such outputs to compare how the Arabic music has been evolving from the early 1900s till the current modern times in the twenty first century. We plan to do emotion analysis over the lyrics of Abdel Halim (and other artists), in particular, to investigate the evolution of particular emotions over the course of the artist’s career. Such emotions vary from joy and happiness to sadness, despair and abandonment, from anger to strong patriotic feelings. Reviewing the Arabic NLP literature, it seems that the best track for emotion analysis should be done through the use of lexicons. There are already several developed publicly available Arabic lexicons for emotions. On the other hand, there are very few datasets (with limited sizes) for Arabic emotion making the machine learning approach not very much feasible and/or reliable in the short term.


The linguistic category model (LCM) is a tool for the systematic study and analysis of language. In our context it can provide more understanding of the language use in the lyrics. The idea of the LCM model is to classify interpersonal (transitive) verbs which are used to describe psychological states or actions and adjectives that are employed to characterize the uttering person. A new kind of data structure may be devised for application and processing requirements such as word bags of common words that express or represent emotions in the Arabic culture. This also may very much even depend on the Arabic dialect being analyzed.

References

  • Walid Gomaa. Lyrics analysis of the arab singer abdel elhalim hafez. ACM Trans. Asian Low-Resour. Lang. Inf. Process., jun 2022. Just Accepted.

  • Walid Gomaa. Analysis of arabic songs: Abdel elhalim as a case study. In Aboul-Ella Hassanien, Kuo- Chi Chang, and Tang Mincong, editors, Advanced Machine Learning Technologies and Applications, pages 385–393, Cham, 2021. Springer International Publishing.


World Perception of the Egyptian Uprising

In order to infer how the world has perceived the unfolding of events in Egypt during the last eight years, starting 2010, we take the Guardian newspaper as a sample study to extract valuable information about the world viewpoints on the big events in Egypt during this period. We perform a sentiment analysis on all the articles in the ‘World’ section of the newspaper from the beginning of 2010 till the end of 2017 based on just the keyword ‘Egypt’. We extracted Unigram tokens from each article and used them for making inference using three lexicons dictionaries: afinn, nrc, and bing. The results show that the general trend is slightly negative over all the selected period. Many conflicting feelings were prevalent during this period such as positive, negative, trust, fear, anger and anticipation. The results show also that years 2011 and 2013, where the world witnessed the two uprisings in Egypt, have witnessed the peaks in both positive and negative emotions.
In this work, we perform a sort of text analysis, in particular, sentiment analysis, in order to infer how the world has perceived the unfolding of events in Egypt during the last eight years (starting 2010). We take as a sample study the Guardian newspaper. This choice is based on the following reasons: (1) it is one of the highly reputable newspapers in the world, (2) it provides easy accessibility to its articles through an API interface, (3) it is English based newspaper which can be readily analyzed using available sentiment dictionaries, and (4) it is very representative of the Anglo-Saxon perception of the events in Egypt. Using the API interface provided by the Guardian we have extracted all the articles in the ’World’ section of the newspaper from the beginning of 2010 till the end of 2017. The extraction is based on just the keyword ‘Egypt’. So we started one year before the political upheaval in order to study the preconditions that led to the consequent events in 2011 and thereafter.

Monthly histogram of the count of the keyword ‘Egypt’ over the period 2010– 2017.

The histogram is overlaid with a red curve that indicates the general trend of frequency of the keyword ‘Egypt’ being in the international focus since the beginning of 2010. The grey envelope around the trend curve represents the 95% confidence level. It is apparent that there is a sharp increase since the beginning of 2010 that remains stable till the mid of 2014, after which there is a small decline of focus. There are four sharp peaks (frequency is more than 100). Two of them correspond to the uprisings in Jan–Feb 2011 and the uprising in Jun–Jul 2014. The third peak is explained by the streak of events that happened towards the end of 2011 including Maspero demonstrations, clashes in Tahrir square, parliamentary elections, and the burning of the Institute d’Egypte. The last peak happened during November 2015 reflecting the drastic explosion of the Russian Metrojet Flight 9268. It is apparent that some of the major events in Egypt such as the Sinai mosque attack in November 2017 did not get much attention, implying that only events that have more of an international and/or regional consequences get more attention and international publicity.

Average monthly and yearly afinn sentiment analysis over the whole period 2010–2017. The red curve is the trend with 95% confidence envelope.

As can be seen from the trend curve (in red) there has been a steady decrease in positivity that reached its draught in the year 2013, then there has been a reverse direction since then, though slow. Note that there has been a positive peak around September–October 2015, though with the downing of the Russian air jet, this has been followed by a negative decline towards the end of 2015 and beginning of 2016. What can also be seen is that the overall trend is smooth, meaning that all sentiments, to a large extent, have been conservative; no sustained sharp optimism or pessimism.

Yearly sentiment analysis based on the nrc 10 emotions.

It is evident that strong emotions are not prevalent; these include: surprise, sadness, joy, disgust. The most prevalent emotions are: positive, negative, trust, and fear. These are followed by: anger and anticipation. These results are in harmony with the afinn-based analysis in that, the general mood is overall conservative with mild transitions between positive and negative sentiments. It can also be seen that the year 2013 has witnessed the peaks in both positive and negative emotions, which can be described as pre- and post-uprising in June-July that year respectively. Along side these are peaks of trust and fear in the same year as also markings of the post-and pre-uprising in June-July. The same four emotions also pop up in the respective pre- and post-uprising in January 2011, though at a mild scale compared to that of year 2013. It is of interest that all emotions exhibit the least degree in the two years 2010 and 2017 which can be considered the years that predates and postdates the political upheavals in between.

(a) Sentiment analysis based on the nrc 10 emotions over the whole period 2010–2017 and (b) A word cloud, based on nrc lexicon, over the whole period 2010– 2017.

Sentiment word cloud based on the nrc lexicon. (a) during Jan–Feb 2011, (b) during Jan–Mar 2011. Notice that the major words include ‘government’, ‘opposition’, and ‘revolution’; followed to a lesser extent by ‘president’, ‘military’, ‘police’, and ‘corruption’. Notice that ‘brotherhood’ is a minor word, and is wrongly associated with the ‘trust’ sentiment as its semantical interpretation in the nrc lexicon (fraternity), as a unigram, does not have any political association. Including March in the word cloud does not change much; this can be attributed to the relaxation of the major events in addition to less coverage from the international media as major events are smoothed out.

Emotional word cloud based on the nrc lexicon. (a) during Jun–Jul 2013, (b) during Jun–Aug 2013. This figure reflects a different reality consistent with the state of affairs at that time. We see in this case ‘brotherhood’ as the main word as the Muslim Brotherhood were in power at that time while absent, even in participation in the first uprising in Jan–Mar 2011. Subsequent main words include ‘military’, ‘coup’, and ‘revolution’ which might indicate some confusion in the Guardian’s, and maybe in the western media in general, perception of the events in Egypt during Jun–Jul 2013. Other expected main words include ‘opposition’, ‘government’, ‘violence’, and ‘president’.

World Perception of the Latest Events in Egypt Based on Sentiment Analysis of the Guardian’s Related Articles (a) Radar chart for the years 2011, 2013, 2017, and over the whole period 2010– 2017. Sentiments are based on the nrc lexicon. (b) Major sensational words along with their polarities (|polarity| ≥ 1000) over the whole period 2010–2017.

Polarity score, based on the bing lexicon, both on a yearly basis and over the whole period 2010–2017. Generally, there is a tendency and attraction towards the neutrality position ≈0. For half the years the polarity is almost constant. The second half, including 2011, 2012, 2013, 2014, 2015, witnessed sharp changes. The years 2011 and 2013 have witnessed the two major uprisings, where eventually towards the end of these years the polarity leaned upwards towards neutrality and even positivity in 2013. The mid of 2012 witnessed the presidential decree of calling into session the dissolved parliament and the corresponding rejection to that decision by Egypt’s Supreme Constitutional Court, followed by the resignation of the Minister of Defense and Chief of Staff; a lot of protests and demonstrations; then in November that year came the presidential declaration that in effect immunized any presidential action from any legal challenge which caused fury in the streets. So all such uncertainty, instability, and confusion induced sharp negative sentiment towards the end of 2012. From fig. (b), it is evident that the most negative polarization occurred during the year 2013 after which recovery occurred until it reached a state similar to that of 2010.

Conclusion

During the period 2010-2017, many critical events have occurred in Egypt; two main uprisings in years 2011 and 2013, two presidential elections, and some other serious accidents. The world newspapers reflected the world anxiety about these events and their influence on the middle east region. We studied the Egypt’s related articles in the Guardian newspaper in this period using lexicon-based sentiment analysis with 3 different lexicons. Based on our study, we found that the general emotion was slightly negative during this period. Many contradicted feelings were prevalent such as (positive, negative, trust, and fear) especially in the two years of uprisings which may reflect the political polarization in Egypt during this period.

References

  • Walid Gomaa and Reda Elbasiony. World perception of the latest events in Egypt based on sentiment analysis of the guardian’s related articles. In Aboul Ella Hassanien, Ahmad Taher Azar, Tarek Gaber, Roheet Bhatnagar, and Mohamed F. Tolba, editors, The International Conference on Advanced Machine Learning Technologies and Applications (AMLTA2019), pages 908–917, Cham, 2020. Springer International Publishing.