TASK DESCRIPTION

Participants are challenged to predict, given the text of a tweet in Italian, the most likely emoji associated.

  • Participants are provided with 250,000 tweets in Italian to train their emoji prediction system: each tweet will include one and only one emoji, eventually repeated. We consider a set of 25 emojis in the ITAmoji task (see the Label set section below for more details)
  • Starting from the release date of the Test dataset (25,000 Italian tweets, 3rd of September), participants have two weeks to submit their predictions of the most likely emoji to associate to each tweet text. Optionally, in order to allow a more fine-grained evaluation of results, participants can submit for each tweet text the ordered ranking of the 25 emojis considered in ITAmoji, from the most likely to the less likely to be associated to the text of the tweet
  • Each submission of each participant will be evaluated with respect to the macro F-score computed considering the most likely emoji predicted for each tweet text (official task metric). In addition, participants that will predict the ordered ranking of the 25 emojis (not only the most likely emoji) will be evaluated by considering also Accuracy@5/10/25 and Coverage Error (see the System evaluation section below for more details)

Below you can find more information on the Training and Testing Datasets, the Label set and the System evaluation approach. Go to Data and tools to download the Training dataset as well as to access a list of tools and resources provided to the participants to the ITAmoji task.

Train and Test Datasets

The dataset for this task consists of 275,000 tweets in Italian, each one containing one and only one emoji over a set of 25 emojis (see below for more details): 250,000 tweets are provided to the participants as Train Dataset (go to Data and tools to get the data), while the remaining 25,000 tweets are included as part of the Test Dataset.

We randomly selected our set of 275,000 tweets in Italian from the following two collections:

  • Italian tweets issued from October 2015 to February 2018 in Italy, retrieved by means of the Twitter streaming API
  • Italian tweets issued by the followers of the Twitter accounts of the most read Italian newspapers

IMPORTANT: participants have to train their systems exclusively on the 250,000 tweets included in the training dataset. It is not allowed to gather additional tweets to extend the training dataset with more training examples.

Label set

We consider tweets including one and only one emoji in a set of 25 emojis. To refer to each emoji of this set we associate a label (string) - e.g. red_heart, face_with_tears_of_joy, etc. The following table shows the label of each one of the 25 emojis considered in ITAmoji.

In the training dataset, the label of each tweet is specified by the value of the "label" field (more datails on the structure of the training dataset are provided at Data and tools).

The order of emoji labels in this table is from the most to the less frequent in the training dataset.

System evaluation

The evaluation of the emoji prediction systems will be based on the classic Precision and Recall metrics over each emoji. In particular, the final ranking of the participating teams will be based on Macro F-score computed with respect to the most likely emoji predicted given the text of each tweet of the test set. In this way we intend to encourage systems to perform well overall, which would inherently mean a better sensitivity to the use of emojis in general, rather than for instance overfitting a model to do well in the three or four most common emojis of the test data Macro F-score can be defined as simply the average of the individual label-wise F-scores.

Often the semantics of sets of emojis can be similar: as a consequence, to compare with a finer grain the emoji prediction quality of distinct systems, when two systems fail in predicting the right emoji to associate to a tweet, it is important to distinguish between the one that identifies the right prediction among the most likely emojis to be associated to that tweet and system that characterize the right prediction as an emoji that is unlikely to be associated to that tweet. To this purpose, we give ITAmoji task participant the possibility to submit as task results the ordered ranking of the 25 emojis considered in ITAmoji. Besides the Macro F-score, these systems will be evaluated by considering the rank of the right prediction, by relying on Accuracy@5/10/25 and Coverage Error.


In respect to baseline systems, we will consider the following three settings:

  • Baseline 1 ​(majority) This baseline predicts the most frequent class in the training set
  • Baseline 2 ​(weighted random) This baseline is designed to predict emojis randomly, but keeping the prediction class distribution similar to the emoji distribution in the training set 3
  • Baseline 3 (Bag of words) Finally, the most sophisticated baseline will consist on the following: Each text of each tweet will be represented by means of a vector of the most informative tokens (punctuation included), selected using term-frequency*inverse-document-frequency (tf*idf). An L2-regularized logistic regression classifier will be employed to make the ultimate predictions of the most likely emoji to associate to that tweet text.