The three core dimensions of our dataset are Gender, Age and Topic. These can give rise to interesting interactions, which we aim to capture through the following subtasks that will be proposed to the participants:
SUBTASK 1 (Predict all dimensions at once)
Given a collection of texts from a blog written by the same author, the gender, the age and the topic mentioned by the author have to be predicted. The task is cast as a multi-label classification task, with gender represented as F (female) or M (male), the age as an age range (eg: 30-39) and the topic as a string label.
SUBTASK 2 (Predict age and gender with topic control)
Subtask2a: predict gender, controlling for topic (two settings: one where all texts come from the same topic, one where the topic is random)
Subtask2b: predict age, controlling for topic (two settings: one where all texts come from the same topic, one where the topic is random)
External Resources. Participants are also free to use external resources as they wish, provided the cross-genre settings are carefully preserved, and everything used is described in detail.
Baseline. For all tasks, given that dataset labels are unbalanced, we will use a majority baseline.