Given a (collection of) text(s) from a specific genre, the gender of the author has to be predicted. The task is cast as a binary classification task, with gender represented as F (female) or M (male). Gender prediction will be done in two ways:
A crucial aspect of this task is designing evaluation settings, as they are key to shed light on the core question: are there indicative traits across genres that can be leveraged to model gender in a rather genre-independent way?
This question will be answered by making participants train and test their models on datasets from different genres. For comparison, participants will also submit genre-specific models that will be tested on the very same genre they have been trained on. In-genre modelling will (i) shed light on which genres might be easier to model, i.e. where gender traits are more prominent; and (ii) make it easier to quantify the loss when modelling gender across genres.
More specifically, participants will be asked to submit up to ten different models:
Obviously, if one participant believes s/he has the perfect single model for everything, they can submit one model for all settings. In the cross-genre setting, the only constraint is not using in training any single instance from the genre they are testing on. Other than that, participants are free to combine the other datasets as they wish.
Participants are also free to use external resources as they wish, provided the cross-genre settings are carefully preserved, and everything used is described in detail.
Evaluation. As this is a binary classification tasks with balanced data, as standardly done in author profiling we will evaluate performance using accuracy.
For each of the 10 models, five in the in-genre settings, and five in the cross-genre settings, we will calculate the average accuracy for the two classes, i.e. F and M. Considering that both classes are equally important, we will use macro accuracy, which is appropriate also because the classes are balanced in size.
In order to derive two final scores, one for the in-genre and of for the cross-genre settings, we will simply average over the five accuracies obtained per genre.
We will keep the two rankings separate. For determining the official “winner”, we will use the cross-genre ranking.
Baselines. For all settings, given that the datasets are balanced for gender distribution, through random assignment we will have a 50% baseline. Size comparability will be ensured in the test sets, and aimed at in the training sets.