GeoLingIt
Geolocation of Linguistic Variation in Italy

Shared task at evalita 2023

Introduction

GeoLingIt is the first shared task on geolocation of linguistic variation in Italy from social media posts exhibiting non-standard Italian language. It is part of the EVALITA 2023 evaluation campaign, whose workshop will be held in Parma (Italy) on September 7-8, 2023. 

The task is meant to both advance natural language processing (NLP) in dealing with non-standard Italian language, and inform sociolinguistics with language variation insights derived from large-scale, quantitative analysis (e.g., to enrich and complement linguistic atlases).

GeoLingIt is open to everyone and we strongly encourage linguists to participate!

News

Sep 9th, 2023 GeoLingIt has been a success! Many thanks to all participants and EVALITA organizers!
Aug 29th, 2023 Both the oral and the poster sessions will be on Sep 8th, 2023 (full program here)
May 7th, 2023 Test data is available to participants! The evaluation window will remain open until May 17th 14th, 2023
Feb 7th, 2023GeoLingIt begins! The data and the evaluation script are available to participants.
Nov 10th, 2022 – The website for the GeoLingIt shared task is online, you can already register here!

Motivation

The ever-growing number of people who interact on social media opens opportunities to study language use across several sociolinguistics dimensions. By nature, user-generated texts on social media are indeed informal, featuring linguistic patterns from spoken language. 

Among the dimensions of variation in language, diatopic variation (i.e., variation across space) is one of the dominant ones, especially when it comes to Italy. Indeed, "there is probably no other area in Europe in which such a profusion of linguistic variation is concentrated into so small a geographical area" (Maiden & Parry, 1997). As one of the most linguistically-diverse landscapes in Europe, Italy comprises a large number of local languages, dialects, and regional varieties of Standard Italian (i.e., regional Italian) (Ramponi, 2022). Although Italian speakers try to adhere to a standard language (i.e., Standard Italian) when addressing the public, in online fora (e.g., Twitter) they often tend to employ words, constructions, or clauses in their own local language, dialect, or regional variant as a way to signal their social identities.

This makes the study of variation in Italy compelling from multiple perspectives, from computational linguistics – which can leverage social media data to discover regional and dialectal patterns and ultimately improve language technologies for Italian and minority languages – to sociolinguistics – which can be informed by large-scale, quantitative analyses to enrich and complement current linguistic atlases.

Task overview

The GeoLingIt shared task aims at advancing the current knowledge of linguistic variation in Italy through the design and evaluation of methods for the prediction of locations of social media posts from Twitter (coordinates or coarse areas) based solely on linguistic content

In contrast to previous geolocation shared tasks on other areas (Han et al., 2016; Gaman et al., 2020; Chakravarthi et al., 2021), GeoLingIt is focused on Italy, and each post is filtered to contain non-standard Italian language, so that methods and findings will be strongly focused on non-standard Italian language variation rather than on highly-localized lexical items (e.g., mentions of events, places, or tourist attractions).

Linguistic variation in GeoLingIt data can manifest with just single words or phrases (i.e., items in a local language, dialect, or regional synonyms e.g., guaglione, toso, picciotto for "young man"), with code-switching (i.e., alternation of Standard Italian and a local language, dialect, or regional variant), or as entire posts written in a specific local language or dialect.

Learn more on the task (with associated tracks and subtasks), the data (with examples), and how to participate!

Important dates