ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Workshop on May 4th, 2023.

Accepted papers: https://openreview.net/group?id=ICLR.cc/2023/Workshop/ME-FoMo

Foundation models (FMs) are models that are trained on a large and diverse pool of data and can be adapted to a wide range of tasks. Recent examples of FMs include large language models (GPT-3, BERT, PaLM), image representation encoders (SimCLR), and image-text models (CLIP, DALL-E), which have all revolutionized the way models are built in their domains.

The goal of this workshop is to highlight research that aims to improve our understanding of FMs, and bring together researchers that work in the area. We liberally interpret understanding as any research ranging from purely empirical papers that highlight interesting phenomena, to those which attempt to explain or provide theoretical foundations for such phenomena in potentially simplified settings, for example in two-layer neural networks. Examples of relevant topics include, but are not limited to:

Pretraining: How do FMs learn useful representations?
- Pretraining / downstream interface: Objectives such as language modeling and contrastive learning are different from the downstream tasks, and on unlabeled data—why do the representations learned transfer to downstream tasks?
- Role of data: Can we better understand when representations transfer to new domains and modalities e.g., when there is positive or negative transfer? What is the role of augmentations for self-supervision (e.g., masking in masked language modeling and positive pairs in contrastive learning). How should we select optimal data for pretraining?
- Loss functions: Does the choice of loss function make a difference? What are the tradeoffs between contrastive, generative, and supervised losses in Computer Vision, masked language modeling and autoregressive modeling in NLP, and cross-modality translation losses in multimodal models?
- Role of architecture: Does architecture affect the learned representations? What is the effect of model scale, role of attention vs recurrence, effect of nonparametric models (retrieval models, deep k-nearest neighbors), and diffusion vs autoregressive generation?
Adaptation: How should we adapt FMs?
- Fine-tuning: Carefully fine-tuning small parts of the network can lead to comparable (or better) accuracy than full fine-tuning, with much less memory. Is there a principled way to choose which parts of the network to fine-tune? Can we develop more efficient adaptation methods?
- Few-shot learning, prompting, in-context learning: Can we better understand prompting, in-context learning, and representation-based few-shot learning (CLIP, DINO), and develop FMs with better few-shot learning capabilities?
- Robustness / calibration: Fine-tuning FMs on labeled ID/source data, also leads to high accuracy out-of-distribution (OOD—where we do not have labeled data) - why?
- Pruning, speed, memory: Foundation models get very high accuracies, but a drawback of their large size is they are slow and consume a lot of memory. Is scale really necessary, or are there principled ways to distill these models into faster or more memory efficient models that preserve accuracy?
- Biases: Foundation models may pick up biases from their (pre)training data—some adaptation methods may keep around these biases more than others —can we understand this better?
Emergent phenomena: Can we understand how scale seems to lead to qualitatively different behaviors (e.g., robustness, in-context learning, reasoning, chain-of-thought) that can emerge suddenly (e.g., grokking)?
- Capabilities that arise with scale: Can we understand how models develop in-context learning capabilities, few-shot reasoning capabilities such as Chain of Thought (CoT), and improved robustness / calibration?
- Scaling laws: Can we understand how and why performance will scale with data, compute, and model size? Can we understand emergence of new capabilities and sudden transitions with scale, and predict when new ones will occur?