Workshop Program

The workshop takes place on Sunday 16 October. All times are Korea time.

This year, VarDial does not feature a poster session. Instead, we have short (10+5 minutes) and long (20+10 minutes) oral presentations.

10:00–10:10 - Opening Session

10:10–10:30 - Findings of the VarDial Evaluation Campaign 2022 - Noëmi Aepli, Antonios Anastasopoulos, Adrian Chifu, William Domingues, Fahim Faisal, Mihaela Gaman, Radu Tudor Ionescu, Yves Scherrer

10:30–11:00 - Shared task system description papers I

10:30–10:45 - Transfer Learning Improves French Cross-Domain Dialect Identification: NRC @ VarDial 2022 - Gabriel Bernier-Colborne, Serge Leger and Cyril Goutte
10:45–11:00 - Is Encoder-Decoder Transformer the Shiny Hammer? - Nat Gillin

11:00–11:30 - Coffee break

11:30–12:15 - Shared task system description papers II

11:30–11:45 - Italian Language and Dialect Identification and Regional French Variety Detection using Adaptive Naive Bayes - Tommi Jauhiainen, Heidi Jauhiainen and Krister Lindén
11:45–12:00 - The Curious Case of Logistic Regression for Italian Languages and Dialects Identification - Giacomo Camposampiero, Quynh Anh Nguyen and Francesco Di Stefano
12:00–12:15 - Neural Networks for Cross-domain Language Identification. Phlyers @Vardial 2022 - Andrea Ceolin

12:15–13:15 - Invited Talk by Tanja Samardžić (University of Zurich): Data-centric vs. model-centric solutions for dialect identification

Abstract: The VarDial evaluation campaigns have provided crucial insights and data sets for studying dialect identification as one of the basic tasks in multilingual NLP. This task is an especially challenging case of text encoding because the classification does not rely on an abstract semantic representation of the whole sentence (as in usual text classification), but on surface features of the text, such as distinctive suffixes or prefixes of words, phonetic clusters and the order of tokens. These features show up occasionally in the text, which otherwise might look the same in two different dialects. In this talk, I will present the results of several experiments performed on the VarDial data with the goal of understanding the interplay between the properties of data sets and the models used for solving the task. We focus specifically on two data properties: the length of the input strings and the size of the subword vocabulary. Regarding the models, we compare small-size models, such as CNNs trained from scratch, with large pretrained BERT models. The results of our experiments point to a potential constant input length (normalised across data sets) that can be used for early dialect identification. In addition to this, we find an interesting interaction between the size of subword vocabulary, the size of models and the performance scores.

13:15–14:30 - Lunch break

14:30–16:00 - Research papers I

14:30–15:00 - Phonetic, Semantic, and Articulatory Features in Assamese-Bengali Cognate Detection - Abhijnan Nath, Rahul Ghosh and Nikhil Krishnaswamy
15:00–15:15 - Annotating Norwegian language varieties on Twitter for Part-of-speech - Petter Mæhlum, Andre Kåsen, Samia Touileb and Jeremy Barnes
15:15–15:45 - OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan - Aleksandra Miletic and Yves Scherrer
15:45–16:00 - Low-Resource Neural Machine Translation: A Case Study of Cantonese - Evelyn Kai-Yan Liu

16:00–16:30 - Coffee break

16:30–17:45 - Research papers II

16:30–17:00 - Mapping Phonology to Semantics: A Computational Model of Cross-Lingual Spoken-Word Recognition - Iuliia Zaitova, Badr Abdullah and Dietrich Klakow
17:00–17:15 - Social context and user profiles of linguistic variation on a micro scale - Olga Kellert and Nicholas Hill Matlis
17:15–17:45 - dialectR: Doing Dialectometry in R - Ryan Soh-Eun Shim and John Nerbonne

17:45–18:45 - Invited Talk by Dong Nguyen (Utrecht University): Towards representing linguistic variation and social meaning: Challenges and first steps

Abstract: There are often various ways to express the same thing. Think of, for example, the different words we can use for a given concept, or the many creative spellings in social media. Language variation is sometimes seen as a challenge for learning representations in NLP, but in this talk, I will discuss how language variation is also an opportunity: it can help us develop representations that are more sensitive to social context. First I will talk about spelling variation: What social meanings are associated with various types of spelling variation? And do word embedding algorithms encode variation patterns? Next, I will focus on stylistic variation. I will present a new similarity-based benchmark for style and first steps in learning style representations.

18:45–19:00 - Closing Remarks