ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models

(ME-FoMo)

Date : Saturday, May 11th, 2024

Location: Room Strauss 2 @ Messe Wien Exhibition and Congress Center, Vienna Austria

Accepted papers: https://openreview.net/group?id=ICLR.cc/2024/Workshop/ME-FoMo

ICLR workshop site: https://iclr.cc/virtual/2024/workshop/20589 (Zoom link available here)

Update April 21, 2024: Schedule is available here!

Foundation models (FMs) have revolutionized machine learning research across domains. These models are trained on extensive, highly varied datasets and can be quickly adapted to solve many tasks of interest. FMs are extremely effective on language (e.g., GPT-3 , BERT, PaLM, LLaMa ), vision (e.g., SimCLR), speech (e.g., Whisper), and multi-modal (e.g., CLIP, DALL-E) inputs.

However, understanding of FMs lags far behind their extraordinary performance. FMs are known for their surprising emergent capabilities, such as in-context learning , but rigorous characterization of such phenomena is sorely lacking. Recently, substantially smaller models (e.g., LLaMA) have demonstrated performance comparable to or better than huge FMs from the previous generation (e.g, OPT). These findings suggest that careful selection of data, training objectives, and adaptation methods can more effectively induce desirable properties in FMs. Development of such techniques can be accelerated through better understanding.

This workshop aims to bring together researchers who work on developing an understanding of FMs, through either careful experimentation or theoretical work. Rigorous characterization of FMs can also contribute to the broader goal of mitigating undesirable behaviors. FMs are now broadly available to users, so misaligned models present real-world risk. We thus also welcome submissions of previously unpublished works that investigate how to better characterize biases in models and align them.

Topics of the workshop

The workshop will focus on three main aspects of FMs: pretraining, adaptation, and emergent capabilities. These components may include, but are not limited to, the following topics.

Pre-Training: How do FMs learn useful representations? Supervised downstream tasks (e.g., solving math word problems) are often markedly different from the self-supervised pre-training objective. When and how does pre-training improve performance on a diverse set of downstream tasks? Possible sub-topics include:
- Understanding the data
  - How does the quality of the dataset impact the power of the learned representation?
  - Fundamental scaling and limits: how much data do we need? Given a fixed compute budget, is it better to increase the model size or the dataset size?
  - What subsets of the data are most important for the performance and capabilities of foundation models?
- Loss Functions
  - Vision: contrastive vs. generative vs. masked autoencoding
  - Language: masked language modeling, autoregressive modeling, auxiliary objectives; tokenization methods
  - Multi-modal: contrastive objectives, translation-driven objectives
- Model Architecture
  - Effect of model scale
  - Attention vs recurrence (e.g., structured state-space models)
  - Nonparametric or semi-parametric models: retrieval-augmented models
  - Diffusion models vs autoregressive models
  - Mixture-of-experts
- Generalization, transfer, and representation learning
  - Role of optimization on representation learning and transfer
  - Analyzing learned representations
  - Theory in simplified models
  - Training dynamics and hyperparameters at scale
Adaptation: How can we quickly adapt FMs? FMs are trained using unlabelled data with general-purpose objectives, so how can we effectively adapt them to meaningful downstream use cases? Possible subtopics include:
- Fine-tuning, prompting, in-context learning
  - How does fine-tuning modify the pre-trained representation?
  - Representation-based: Multimodal representation learners admit straightforward adaptation to downstream tasks through direct manipulation of the representation space (e.g., DINO). How and when does this work?
  - Investigations into different prompting and decoding methods
  - Which examples should be inserted during in-context learning?
- Instruction Tuning
  - What does instruction tuning do to the base model? How do models learn to generalize in this setting?
  - How can instruction tuning be made more effective?
- Model Un-Learning and Watermarking
  - Given data copyright concerns, there is growing interest in ensuring that a model can “un-learn” (i.e., forget) a datapoint it was pre-trained on. What are effective methods for this?
  - Watermarking outputs can ensure that model generations are identifiable. What types of watermarks are effective while preserving quality?
- Safety and Alignment
  - Pre-trained language models are often fine-tuned to align with human preferences. How does an aligned model differ from the base model?
  - How does reinforcement learning from human feedback (RLHF) work? In what cases can supervised fine-tuning achieve the same goals?
  - What are the safety deficiencies of current FMs? How can we effectively understand the internal works of FMs in order to better align them?
- Robustness, Calibration, and Biases
  - In what cases do FMs generalize to out-of-distribution examples? Why? How can we encourage this behavior?
  - What kinds of biases are accumulated in FMs during pre-training? How can we later remove or mitigate these biases?
- Efficient methods
  - Fine-tuning often modifies a small subspace of the model parameters. Do we really need scale during fine-tuning? Can fine-tuning be made more efficient?
  - Task-aware pruning and distillation methods may yield smaller, more efficient models that preserve downstream performance. How do these methods work? Can we make them more effective?
Emergent phenomena: Scale appears to drive qualitatively different behavior in models (e.g., in-context learning, reasoning, chain-of-thought) that can emerge suddenly during training (e.g., grokking). We lack a rigorous understanding of what increasing the scale does to the training procedure and how these desirable emergent capabilities come about. Possible subtopics include:
- Scale-driven capabilities
  - Chain of Thought, reasoning, in-context learning capabilities
  - Improved robustness and calibration
  - Improved characterization of emergent capabilities
- Scaling laws
  - How and why does performance scale with data, compute, and model size?
  - Grokking: how do new capabilities suddenly emerge during FM training?

Please see the Call for Papers for important dates, links, and policies.

Papers will be submitted on OpenReview: https://openreview.net/group?id=ICLR.cc/2024/Workshop/ME-FoMo

Timeline for paper reviewing and acceptance: We expect the paper submission deadline to be Feb 3rd, 2024, with a review period between Feb 10th to Feb 24th. We will release all decisions to the authors by the global notification deadline of March 3, 2024.

Speakers

Hannaneh Hajishirzi

University of Washington, AI2

Alexander "Sasha" Rush

Cornell University and HuggingFace

Amir Globerson

Tel Aviv University

Yuandong Tian