Multilingual Representation Learning
Topics in CS (Applications): COMP 598-002
Instructor: David I. Adelani, Assistant Professor, McGill University & Mila
Contact: david.adelani@mcgill.ca
Term: Winter 2025
When: Tuesdays & Thursday, 10:05am-11:25am
Where: McConnell Engineering Building 103
Course Description
This course is a seminar-style course that focuses on advances in multilingual representation learning and how to scale language technologies to many languages of the world including high-resource languages (e.g. English and French), mid-resource languages (e.g. Indonesia and Swahili) and low-resource languages (e.g. Wolof and Quechua) , and some multimodal applications to images and speech. In the first four lectures, I will provide an overview of multilingual NLP, text embedding models, cross-lingual transfer learning and open problems in NLP.
Prerequisites:
One of the following McGill courses: Natural Language Processing (COMP 550), Natural Language Understanding with Deep Learning (COMP 545), Applied Machine Learning (COMP 551), or a relevant NLP course at other universities. If you are unsure, email me.
Guest lectures:
Julia Kreutzer (Cohere for AI)
Colin Cherry (Google Translate)
Min Ma (Google DeepMind)
Grading (tentative):
This is a demanding course in terms of participation and projects. All deadlines start after the Add/Drop deadline (Tuesday January 14, 2025)
● Reading and Reviewing papers: (20%): You are expected to submit technical reviews (conference style reviews) for one of the paper prior to each class on MyCourses. You will submit 8 such reviews.
● Presenting papers in class (20%): After the first few lectures, each student should form a group of two for joint presentation. Not on the same topic you wrote a technical review for.
● Leading paper discussions in a class (10%): Sign up as panelists for at least twice to critically analyze the presented papers.
● In-class Project Proposal presentation (5%): To get quick feedback on your proposed project.
● Project (40%): You will do a project in groups of two. This involves
Literature review (10%)
Baselines (5%)
Final paper with new experiments and code submission (15%)
Final presentation at the end of term (10%)
● Class participation (5%): how engaged in the lectures you are (asking questions during the lectures, coming to office hours etc).
Topics of interest
Text embeddings: Word embeddings, Sentence embeddings, Text retrieval, Text embedding benchmarks
Cross-lingual transfer: Multingual transformer models, Transfer learning, Parameter efficient fine-tuning, Tokenization issues
Bitext mining: Large scale parallel data for machine translation, Contrastive Learning
Neural machine translation: LSTMs vs. Transformers, Massively multilingual machine translation, Transfer learning
Machine translation evaluation and metrics: Lexical-based metrics, Embedding based metrics
Question answering: Reading comprehension, Open retrieval QA
Participatory research: Scaling NLP to several low-resource languages, Role of participatory/grassroots research, NLP democratization
Large language models: Ingredients of modern LLM architectures, Effect of scaling, Quality of multilingual pre-training data, and Language identification at scale
Multilingual LLM evaluations: Classical NLP tasks, Text generation tasks, Reasoning tasks, Knowledge-intensive tasks
Post-training: Post-training methods, Role of synthetic data
Multimodal and multicultural LLMs: VLMs, Making VLMs multicultural
Multilingual safety: Jail breaking LLMs, Cultural bias of LLMs
Multilingual speech representations: Transformer for speech, Speech evaluations
Automatic Speech recognition and translation: Automatic speech recognition, Speech-to-text translation, Simultaneous speech-to-speech translation
Text-to-speech: Single-speaker vs. Multi-speaker TTS, AudioLLM for TTS
Reading List