TAG-it

Topic, Age and Gender prediction for Italian

OVERVIEW

TAG-it is a profiling task for Italian.

This can be seen as a follow-up of the GxG task organised in the context of EVALITA 2018 though with some differences. GxG was concerned with gender prediction, and had two distinctive traits: (i) models were trained and tested cross-genre, and (ii) evidence per author was for some genres (Twitter and YouTube) extremely limited (one tweet or one comment). The combination of these two aspects yielded scores that were comparatively lower than those observed in other campaigns, and for other languages. One of the core reasons for training the models cross-genre was to remove as much as possible genre-specific traits, but also topic-related features. The two would basically coincide in most n-gram-based models, which are standard for this task.

For this edition, the task is revised addressing these two aspects, aiming at better disentangling the dimensions we are dealing with. First, only a single genre is considered (blogs), and examine performance when controlling for other aspects. Second, longer texts are used, which should provide better evidence than single tweets, and are more coherent than just the concatenation of more tweets. Third, ``topic control'' is introduced in order to try and assess the interaction of topic and lexically rich models (n-gram based) on performance in a more direct way than in GxG (indirectly done via cross-genre prediction).

Data was collected accordingly, including information regarding topic and two author profiling dimensions: gender and age. The interesting aspect of this is that we mix text and author profiling dimensions, which can be addressed separately, but also all at once.