Invited Speakers

Sebastian Ruder, Google

Sebastian Ruder is a research scientist at Google Research working on NLP for under-represented languages and based in Berlin. He was previously a research scientist at DeepMind, London. He completed his Ph.D. in Natural Language Processing at the Insight Research Centre for Data Analytics, while working as a research scientist at Dublin-based text analytics startup AYLIEN. Previously, he studied Computational Linguistics at the University of Heidelberg, Germany and at Trinity College, Dublin. He is interested in cross-lingual learning and transfer learning for NLP and making ML and NLP more accessible.

Efficient Methods for Low-resource NLP

Virtual, 9:00-9:45, Venue: Columbia D

Low-resource NLP settings are not only limited by a scarcity of data but often also pose constraints on the available compute. In this talk, I will discuss approaches to make models more efficient in terms of samples, space, and time. In low-resource settings, inductive biases and alternative forms of data are particularly important. I will highlight inductive biases such as morphology and segmentation and data sources such as lexicons that have been found useful for low-resource NLP, in particular in multilingual settings. To enable more space-efficient methods, I will discuss approaches that allocate additional language-specific capacity. Finally, I will discuss character-based approaches that enable faster training and inference through efficient down-sampling strategies. I hope this talk will encourage more researchers to consider efficiency as an evaluation criterion and to develop methods that efficiently scale to low-resource settings.

David Ifeoluwa Adelani, Saarland University

David Ifeoluwa Adelani is a doctoral student in computer science at Saarland University, Saarbrücken, Germany, and an active member of Masakhane NLP - a grassroots organization whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans. His current research focuses on NLP for African languages, multilingual representation learning, and privacy in NLP.

Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

In-person, 10:30-11:15, Venue: Columbia D

Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) --- fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50%. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.

Yulia Tsvetkov, University of Washington

Yulia Tsvetkov is an assistant professor at the Paul G. Allen School of Computer Science & Engineering at University of Washington. Her research group works on computational ethics, multilingual NLP, and machine learning for NLP. This research is motivated by a unified goal: to extend the capabilities of human language technology beyond individual populations and across language boundaries, thereby making it available to all users. Prior to joining UW, Yulia was an assistant professor at Carnegie Mellon University and a postdoc at Stanford. Yulia is a recipient of NSF CAREER, Sloan Fellowship, Google Faculty and Amazon Machine Learning Research awards, and Okawa Research award.

Interpretation as Weak Supervision for Low Resource NLP

In-person, 13:30–14:15, Venue: Columbia D

Deep learning is typically associated with an abundance of data. But there are scenarios when pre-collected data will never be enough. For example, language on social media is constantly evolving and pretrained language models cannot adapt to rapid language change, dialects, and sociolects, no matter how large pretraining/annotated datasets are. Other examples of constantly evolving and therefore always-low-resource language domains include scientific articles, expert notes, and even news. In this talk, I will advocate for using model interpretability methods to dynamically procure data annotations in such low resource scenarios. In the first part, I will show how instance attribution approaches to model interpretability can identify critical training examples to improve the robustness and adaptability of hate speech classifiers. In the second part, I'll show how self-explaining models can be used for entity and keyphrase extraction in scientific articles. I'll conclude with more ideas for this new paradigm of using approaches to interpreting neural networks as an intrinsic component in low-resource NLP systems and not only as a tool to present explanations to humans.

Graham Neubig, Carnegie Mellon University

Graham Neubig is an associate professor at the Language Technologies Institute of Carnegie Mellon University. His research focuses on multilingual natural language processing, natural language interfaces to computers, and machine learning methods for NLP, with the final goal of every person in the world being able to communicate with each-other, and with computers in their own language. He also contributes to making NLP research more accessible through open publishing of research papers, advanced NLP course materials and video lectures, and open-source software, all of which are available on his web site.

Can we Automatically Create Language-learning Textbooks?

In-person, 15:30–16:15, Venue: Columbia D

While around half of the world's languages are in danger of being extinct, there has also been a surge in interest in learning these languages, often by younger people who grew up speaking a different (usually colonial) language. However, due to years of stigmatization or underprioritization, there are often not sufficient language learning materials to aid these potential learners. In this talk, I will discuss what I think is a grand challenge in low-resource natural language processing: automatically creating materials that will help interested language learners learn the language of their choice. I will introduce one of our efforts in this direction: AutoLex (https://autolex.co), a tool that uses syntactic and semantic analysis techniques to analyze text in a low-resource language, then aggregate these analyses into teachable grammar points that can be used to help understand how the language works as a whole. I will also discuss a study that we performed together with teachers of Kannada and Marathi, two Indian languages that have fewer available resources for second-language learning, demonstrating the utility of this method in an actual curriculum design context. All-in-all, I hope this talk will encourage others to think of aiding human language learning as an important application area of low-resource NLP, and also encourage more research on the fundamental building blocks necessary therefore, such as dependency parsing and word-sense disambiguation.