The metric that will be used to evaluate entries is Mean Recall at k, with k=6 (MR@6):
- For each sample, the set of k=6 predicted categories (P) will be compared to the gold categories (G). The recall will be computed as follows:
In other words, the recall for each sample is the fraction of the gold categories which are correctly predicted.
- Example:
- the gold label is the following: ['yes', 'popcorn', 'good_luck']
- the prediction is the following: ['eye_roll', 'facepalm', 'fist_bump', 'good_luck', 'popcorn', 'happy_dance']
- The recall is 2/3 = 0.667, since two-thirds of the gold categories were successfully predicted.
The per-sample recall scores are averaged over all samples to produce the final recall score p, which will be used to rank the submissions.
In addition to the overall recall p, we also report p1 and p2, which are:
- p1: the MR@6 for all samples where the response tweets include both an animated GIF and text.
- p2: the MR@6 for all the samples when response tweets include only an animated GIF (i.e., "reply" == "").
The metric used for ranking the submissions is p (MAR@6 for all samples). p1 and p2 are provided for information purposes only.