Sarcasm Identification of Dravidian Languages Tamil & Malayalam

Task Description

Sarcasm is regarded as one of the most challenging tasks for sentiment analysis systems. It indirectly communicates an opinion, with the intended meaning diverging from the literal one. There is an increasing demand for sarcasm and sentiment detection on social media texts which are largely code-mixed for Dravidian languages. Code-mixing is a prevalent phenomenon in a multilingual community and the code-mixed texts are sometimes written in non-native scripts. Systems trained on monolingual data fail on code-mixed data due to the complexity of code-switching at different linguistic levels in the text. This shared task presents a new gold standard corpus for sarcasm and sentiment detection of code-mixed text in Dravidian languages (Tamil-English and Malayalam-English).


The Tamil language is spoken by Tamil people in India and Sri Lanka, and by the Tamil diaspora around the world, with official recognition in India, Sri Lanka, and Singapore. Malayalam is a Dravidian language spoken predominantly by the people of Kerala, India. The Tamil script evolved from the Tamili script, Vatteluttu alphabet, and Chola-Pallava script. It has 12 vowels, 18 consonants, and 1 āytam (voiceless velar fricative). Minority languages such as Saurashtra, Badaga, Irula, and Paniya are also written in the Tamil script. Malayalam scripts are alpha-syllabic, belonging to a family of abugida writing systems that is partially alphabetic and partially syllable-based. However, social media users often mix Roman script for typing because it is easy to input. Hence, the majority of the data available in social media for these under-resourced languages are code-mixed.


The goal of this task is to identify sarcasm and sentiment polarity of the code-mixed dataset of comments/posts in Tamil-English and Malayalam-English collected from social media. A comment/post may contain more than one sentence but the average sentence length of the corpora is 1. Each comment/post is annotated with sentiment polarity at the comment/post level. This dataset also has class imbalance problems depicting real-world scenarios. Our proposal aims to encourage research that will reveal how sarcasm is expressed in code-mixed scenarios on social media.


Anti-Harassment Policy

Broad categories of the track (eg: IR, NLP, ML, etc). More than one category can be mentioned:

Use cases

Sarcasm is a type of verbal irony that expresses disgust or derision. Sarcasm is regarded as one of the most challenging challenges for sentiment analysis systems. It indirectly communicates an opinion, with the intended meaning diverging from the literal one. Sentiment analysis is a topic of great interest recently since business strategies can be enhanced with insights obtained from the opinions about the product or subject of interest from the users. As mentioned earlier, most comments on social media are code-mixed. The pervasiveness and user-friendliness of such platforms invite all users from different strata of society to express their subjective opinions and true feelings about a topic with no filter. Hence it is true that the real sentiments about a subject can be extracted from the analysis of code-mixed data. With the introduction of sentiment analysis research, researchers began to tackle the challenges affecting this task, such as sarcasm.

We will release the dataset for YouTube comments on movie trailers. This shared task can bring researchers from academia and industry who are looking to find out the sarcasm and sentiments from social media comments in order to predict how well a movie will be received by the viewers.