Multilingual Representation Learning

Topics in CS (Applications): COMP 598-002

Instructor: David I. Adelani, Assistant Professor, McGill University & Mila

Term: Winter 2025

When: Tuesdays & Thursday, 10:05am-11:25am

Where: McConnell Engineering Building 103

Course Description

This course is a seminar-style course that focuses on advances in multilingual representation learning and how to scale language technologies to many languages of the world including high-resource languages (e.g. English and French), mid-resource languages (e.g. Indonesia and Swahili) and low-resource languages (e.g. Wolof and Quechua) , and some multimodal applications to images and speech. In the first four lectures, I will provide an overview of multilingual NLP, text embedding models, cross-lingual transfer learning and open problems in NLP.

Prerequisites:

One of the following McGill courses: Natural Language Processing (COMP 550), Natural Language Understanding with Deep Learning (COMP 545), Applied Machine Learning (COMP 551), or a relevant NLP course at other universities. If you are unsure, email me.

Guest lectures:

Julia Kreutzer (Cohere for AI)
Colin Cherry (Google Translate)
Min Ma (Google DeepMind)

Grading (tentative):

This is a demanding course in terms of participation and projects. All deadlines start after the Add/Drop deadline (Tuesday January 14, 2025)

● Reading and Reviewing papers: (20%): You are expected to submit technical reviews (conference style reviews) for one of the paper prior to each class on MyCourses. You will submit 8 such reviews.

● Presenting papers in class (20%): After the first few lectures, each student should form a group of two for joint presentation. Not on the same topic you wrote a technical review for.

● Leading paper discussions in a class (10%): Sign up as panelists for at least twice to critically analyze the presented papers.

● In-class Project Proposal presentation (5%): To get quick feedback on your proposed project.

● Project (40%): You will do a project in groups of two. This involves

Literature review (5%)
Baselines (5%)
Proposed method (10%)
Experimental results and discussion (5%)
Code submission (5%)
Final presentation at the end of term (10%)

● Class participation (5%): how engaged in the lectures you are (asking questions during the lectures, coming to office hours etc).

Topics of interest

Text embeddings: Word embeddings, Sentence embeddings, Text retrieval, Text embedding benchmarks
Cross-lingual transfer: Multingual transformer models, Transfer learning, Parameter efficient fine-tuning, Tokenization issues
Bitext mining: Large scale parallel data for machine translation, Contrastive Learning
Neural machine translation: LSTMs vs. Transformers, Massively multilingual machine translation, Transfer learning
Machine translation evaluation and metrics: Lexical-based metrics, Embedding based metrics
Question answering: Reading comprehension, Open retrieval QA
Participatory research: Scaling NLP to several low-resource languages, Role of participatory/grassroots research, NLP democratization
Large language models: Ingredients of modern LLM architectures, Effect of scaling, Quality of multilingual pre-training data, and Language identification at scale
Multilingual LLM evaluations: Classical NLP tasks, Text generation tasks, Reasoning tasks, Knowledge-intensive tasks
Post-training: Post-training methods, Role of synthetic data
Multimodal and multicultural LLMs: VLMs, Making VLMs multicultural
Multilingual safety: Jail breaking LLMs, Cultural bias of LLMs
Multilingual speech representations: Transformer for speech, Speech evaluations
Automatic Speech recognition and translation: Automatic speech recognition, Speech-to-text translation, Simultaneous speech-to-speech translation
Text-to-speech: Single-speaker vs. Multi-speaker TTS, AudioLLM for TTS

Reading List

Untitled document

Page updated

Google Sites

Report abuse