Data and tools

DATA AND TOOLS

Below you can find information to obtain Datasets (Train and Test) and a description of their structure. Moreover, you can also access information on the submission format of your final predictions of the emojis to be associated to a tweet over the Test set.

> Train Dataset (released on 29th May 2018)

The Train Dataset includes 250,000 tweets in Italian to train emoji prediction systems. See the Train and Test Datasets and the Label set section of the Task Description to obtain more information on this dataset.

You can download the train dataset as a password-protected ZIP archive from the following link:

https://drive.google.com/file/d/1VSN9H3vZVTIr0qOXYPmiBp6FHsvovRlb/view?usp=sharing

The password of the ZIP archive will be provided to the participants to the ITAmoji Task by sending an email with subject "ITAmoji train dataset password" to itamojievalita@gmail.com

The Train Dataset is a single text file (UTF-8) with one tweet per line in the following JSON Object format:

{"tid":"TWEET_ID","uid":"USER_ID","created_at":"CREATION_DATE","text_no_emoji":"TWEET_TEXT_WITHOUT_EMOJI","label":"EMOJI_LABEL"}

The TWEET_TEXT_WITHOUT_EMOJI should be exploited to predict the EMOJI_LABEL.

IMPORTANT: participants have to train their systems exclusively on the 250,000 tweets included in the Train Dataset. It is not allowed to gather additional tweets to extend the train dataset with more training examples.

The following table shows the frequency (number of tweets) for each emoji in the Train Dataset:

> Test Dataset (released on 3rd September 2018)

The Test Dataset includes 25,000 tweets in Italian that will be used to evaluate the emoji prediction systems developed by ITAmoji participants. See the Train and Test Datasets and the Label set section of the Task Description to obtain more information on this dataset.

The Test Dataset can be downloaded as a password-protected ZIP archive from the following link:

https://drive.google.com/file/d/1Oz78NjhBi1C92lqP1zLfTW7Dkgtmm5oA/view?usp=sharing

The password of the Test Dataset ZIP archive is the same one used for the Train Dataset ZIP archive (ITAmoji participants should already know such password and thus be able to access the Test Dataset). In any case, the password can be obtained by sending an email with subject "ITAmoji test dataset password" to itamojievalita@gmail.com

The Test Dataset is a single text file (UTF-8) with one tweet per line in the following JSON Object format:

{"tid":"ITAMOJI_TWEET_IDENTIFIER","uid":"USER_ID","created_at":"CREATION_DATE","text_no_emoji":"TWEET_TEXT_WITHOUT_EMOJI"}

Participants can submit up to three system runs. Each submission should be sent by means of an email to itamojievalita@gmail.com by specifying as subject: "Team: TEAM_NAME Run: RUN_IDENTIFIER" and attaching the emoji prediction results file as an UTF-8 encoded text files in the format described below, in the section "Format of the emoji prediction result file".

The deadline for submitting your runs is September 9th 2018 AoE.

Format of the emoji prediction results file

The emoji prediction results file should be an UTF-8 encoded text files where each line describes the emoji(s) predicted for a specific tweet in the Test Dataset.

Each line of the emoji prediction results file should have the following format:

{"tid":"ITAMOJI_TWEET_IDENTIFIER", "label_1":"EMOJI_LABEL_1", "label_2":"EMOJI_LABEL_2", "label_3":"EMOJI_LABEL_3", ...  "label_25":"EMOJI_LABEL_25"}

where:

"tid" (MANDATORY) is the ITAMOJI_TWEET_IDENTIFIER of the Test Dataset
"label_1" (MANDATORY) is the label of the most likely emoji to be associated to the test set tweet (e.g. red_heart, face_with_tears_of_joy, etc.)
"label_2" ... "label_25" (OPTIONAL) are the label of the other 24 emojis ordered from the most likely to the less likely to be associated to the test set tweet (e.g. red_heart, face_with_tears_of_joy, etc.)

Please, verify that the emoji prediction results file includes one and only one row with emoji predictions for each one of the 25,000 tweets of the Test Dataset. If two lines provide emoji predictions for the same tweet (i.e. have the same "tid"), only the first one will be considered.

Only system runs that will provide the whole ordered set of predicted emoji labels ("label_1" ... "label_25") will be evaluated with respect to Accuracy@5/10/25 and Coverage Error, besides Macro F-score. Please, go to the System evaluation section of Data and tools to get more details about the evaluation of system runs.

> Test Dataset with ground truth labels (released on 26th September 2018)

On the 26th of September 2018, once evaluated all the runs of ITAmoji 2018, we released the test dataset (25,000 tweets in Italian used to evaluate the emoji prediction systems) with ground truth labels. This dataset can be downloaded as a password-protected ZIP archive from the following link:

https://drive.google.com/open?id=1kbPGdyI3fg6oSQIjx5-91t7ZXfbyxtFu

The Test Dataset is a single text file (UTF-8) with one tweet per line in JSON Object format: the ground truth label of each tweet is provided by means of the property "ground_truth_label" of the related JSON Object.

Pre-trained Embeddings

Participants are allowed to use additional data to train unsupervised systems (like word embeddings).

From the following URL https://github.com/fvancesco/acmmm2016 it is possible to download pretrained Twitter embeddings for Italian tweets (with dimension 100 and 300).