GeoLingIt data comprises geotagged social media posts from Twitter (tweets) that exhibit non-standard Italian language use. Each post (classified as it by Twitter) has associated latitude, longitude, and place name information that fall within the Italy territory.
As such, the data is entirely focused on Italy and comprises language varieties other than Standard Italian, so methods and findings will be centered on non-standard Italian language variation rather than on highly-localized lexical items (e.g., mentions of events, places, or tourist attractions).
Linguistic variation in GeoLingIt data can manifest with just single words or phrases (i.e., items in a local language, dialect, or regional synonyms – e.g., guaglione, toso, picciotto for "young man"), with code-switching (i.e., alternation of Standard Italian and a local language, dialect, or regional variant), or as entire posts written in a specific local language or dialect.
The dataset is in a tab-separated format, with an example per line and the first line as header.
Subtask A. Each example has three columns:
id: the tweet identifier (anonymized to preserve the user’s anonymity)
text: the text of the tweet (with anonymized user mentions, email addresses, URLs, and location mentions)
region: the region of the tweet in a string format (label for subtask A)
Subtask B. Each example has four columns:
id: the tweet identifier (anonymized to preserve the user’s anonymity)
text: the text of the tweet (with anonymized user mentions, email addresses, URLs, and location mentions)
latitude: the latitude coordinates as a floating point number (label for subtask B)
longitude: the longitude coordinates as a floating point number (label for subtask B)
The GeoLingIt data is divided into three sets that serve the following purposes:
training set: given on Feb 7, 2023 (important dates) for designing your solution(s) / training your model(s)
development set: given on Feb 7, 2023 (important dates) along with a scorer for assessing the performance of your solution(s)
test set: given on May 7, 2023 (see important dates) without labels. You will have to return predictions to us for the final evaluation following the same format described in the previous section
The dataset is meant to study how language use varies across space. As such, information about individual users is not distributed nor can be used by participants. Additionally, user mentions, email addresses and URLs within the text of posts have been anonymized with placeholders. Latitude and longitude coordinates do not correspond to specific places within cities, but instead represent cities as a whole (i.e., posts within the same city will have the same coordinates). Results are meant to be only used in an aggregate form to study diatopic linguistic variation. The data is licensed under the CC BY-NC-SA 4.0 license (full text available here) and complies with the Twitter developer policy.
⚠ Since the dataset consists of social media posts, it contains some profanities, slurs, and hateful content.