We devise a standard track and a special track. For the standard track, participants will be provided with data at the country level, whereas for the special track, participants will be provided with data restricted to an area of choice (subject to data availability and representativeness).
In both tracks, two subtasks are envisioned: coarse-grained geolocation (subtask A) and fine-grained geolocation (subtask B). Subtask A is the simplest one from a technical point of view (predict a region for each post among the possible ones). Subtask B is instead more challenging (predict longitude and latitude coordinates), nevertheless it has the potential to uncover fine-grained linguistic variation, also overcoming the simplification of subtask A (language use lies on a continuum and may cross administrative borders).
Participants can decide to participate in one or more tracks and subtasks, and can specify their preference after registration.
Given the text of a tweet exhibiting non-standard Italian language, predict its region of provenance. This is a classification task, i.e., a region of Italy needs to be predicted.
Evaluation
Systems will be evaluated using macro Precision, Recall, and F1 score on a subset of the regions of Italy (i.e., 13 known, 1<=k<=7 unknown during development), and ranked by macro F1 score (the higher the better).
Given the text of a tweet exhibiting non-standard Italian language, predict its location in terms of longitude and latitude coordinates. This is a (double) regression task, i.e., a pair of real-valued numbers needs to be predicted.
Evaluation
Systems will be evaluated using mean distance in km of predicted coordinates from actual coordinates (the lower the better) on a subset of the regions of Italy (i.e., 13 known, 1<=k<=7 unknown during development).
The special track consists of the same subtasks and evaluation protocol as the standard track, but the focus will be on a subset of the data representing an area chosen by the participants (constrained to data availability and representativeness). This means that the training, development, and test sets will all represent that particular area, and that proposed solutions will be ranked separately for each area.
An area can be a region (e.g., Campania) or a set of regions (areas that are relevant in terms of linguistic variation). In the case of a single region, only subtask B will be possible.Â
This track allows interested participants to make use local knowledge of variants, dialectal terms, and regional forms to study geolocation of linguistic variation and ultimately uncover little known linguistic patterns within specific areas.
Baseline methods have been provided to participants along with training and development data:
Coarse-grained geolocation (subtask A): a most frequent baseline (it always guesses the most frequent region in the training set (i.e., Lazio) for all validation instances); and a logistic regression baseline.
Fine-grained geolocation (subtask B): a centroid baseline (it computes the center point (latitude, longitude) from the training set and predicts it for all test instances); and a k-nearest neighbors baseline.
You can find the evaluation scorer and baselines' scores in the GeoLingIt repository on GitHub.