ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models
(ME-FoMo)
Date : Saturday, May 11th, 2024
Location: Vienna, Austria
Accepted papers: https://openreview.net/group?id=ICLR.cc/2024/Workshop/ME-FoMo
Update April 21, 2024: Schedule is available here!
Foundation models (FMs) have revolutionized machine learning research across domains. These models are trained on extensive, highly varied datasets and can be quickly adapted to solve many tasks of interest. FMs are extremely effective on language (e.g., GPT-3 , BERT, PaLM, LLaMa ), vision (e.g., SimCLR), speech (e.g., Whisper), and multi-modal (e.g., CLIP, DALL-E) inputs.
However, understanding of FMs lags far behind their extraordinary performance. FMs are known for their surprising emergent capabilities, such as in-context learning , but rigorous characterization of such phenomena is sorely lacking. Recently, substantially smaller models (e.g., LLaMA) have demonstrated performance comparable to or better than huge FMs from the previous generation (e.g, OPT). These findings suggest that careful selection of data, training objectives, and adaptation methods can more effectively induce desirable properties in FMs. Development of such techniques can be accelerated through better understanding.
This workshop aims to bring together researchers who work on developing an understanding of FMs, through either careful experimentation or theoretical work. Rigorous characterization of FMs can also contribute to the broader goal of mitigating undesirable behaviors. FMs are now broadly available to users, so misaligned models present real-world risk. We thus also welcome submissions of previously unpublished works that investigate how to better characterize biases in models and align them.
Topics of the workshop
The workshop will focus on three main aspects of FMs: pretraining, adaptation, and emergent capabilities. These components may include, but are not limited to, the following topics.
Pre-Training: How do FMs learn useful representations? Supervised downstream tasks (e.g., solving math word problems) are often markedly different from the self-supervised pre-training objective. When and how does pre-training improve performance on a diverse set of downstream tasks? Possible sub-topics include:
Understanding the data
How does the quality of the dataset impact the power of the learned representation?
Fundamental scaling and limits: how much data do we need? Given a fixed compute budget, is it better to increase the model size or the dataset size?
What subsets of the data are most important for the performance and capabilities of foundation models?
Loss Functions
Vision: contrastive vs. generative vs. masked autoencoding
Language: masked language modeling, autoregressive modeling, auxiliary objectives; tokenization methods
Multi-modal: contrastive objectives, translation-driven objectives
Model Architecture
Effect of model scale
Attention vs recurrence (e.g., structured state-space models)
Nonparametric or semi-parametric models: retrieval-augmented models
Diffusion models vs autoregressive models
Mixture-of-experts
Generalization, transfer, and representation learning
Role of optimization on representation learning and transfer
Analyzing learned representations
Theory in simplified models
Training dynamics and hyperparameters at scale
Adaptation: How can we quickly adapt FMs? FMs are trained using unlabelled data with general-purpose objectives, so how can we effectively adapt them to meaningful downstream use cases? Possible subtopics include:
Fine-tuning, prompting, in-context learning
How does fine-tuning modify the pre-trained representation?
Representation-based: Multimodal representation learners admit straightforward adaptation to downstream tasks through direct manipulation of the representation space (e.g., DINO). How and when does this work?
Investigations into different prompting and decoding methods
Which examples should be inserted during in-context learning?
Instruction Tuning
What does instruction tuning do to the base model? How do models learn to generalize in this setting?
How can instruction tuning be made more effective?
Model Un-Learning and Watermarking
Given data copyright concerns, there is growing interest in ensuring that a model can “un-learn” (i.e., forget) a datapoint it was pre-trained on. What are effective methods for this?
Watermarking outputs can ensure that model generations are identifiable. What types of watermarks are effective while preserving quality?
Safety and Alignment
Pre-trained language models are often fine-tuned to align with human preferences. How does an aligned model differ from the base model?
How does reinforcement learning from human feedback (RLHF) work? In what cases can supervised fine-tuning achieve the same goals?
What are the safety deficiencies of current FMs? How can we effectively understand the internal works of FMs in order to better align them?
Robustness, Calibration, and Biases
In what cases do FMs generalize to out-of-distribution examples? Why? How can we encourage this behavior?
What kinds of biases are accumulated in FMs during pre-training? How can we later remove or mitigate these biases?
Efficient methods
Fine-tuning often modifies a small subspace of the model parameters. Do we really need scale during fine-tuning? Can fine-tuning be made more efficient?
Task-aware pruning and distillation methods may yield smaller, more efficient models that preserve downstream performance. How do these methods work? Can we make them more effective?
Emergent phenomena: Scale appears to drive qualitatively different behavior in models (e.g., in-context learning, reasoning, chain-of-thought) that can emerge suddenly during training (e.g., grokking). We lack a rigorous understanding of what increasing the scale does to the training procedure and how these desirable emergent capabilities come about. Possible subtopics include:
Scale-driven capabilities
Chain of Thought, reasoning, in-context learning capabilities
Improved robustness and calibration
Improved characterization of emergent capabilities
Scaling laws
How and why does performance scale with data, compute, and model size?
Grokking: how do new capabilities suddenly emerge during FM training?
Please see the Call for Papers for important dates, links, and policies.
Papers will be submitted on OpenReview: https://openreview.net/group?id=ICLR.cc/2024/Workshop/ME-FoMo
Timeline for paper reviewing and acceptance: We expect the paper submission deadline to be Feb 3rd, 2024, with a review period between Feb 10th to Feb 24th. We will release all decisions to the authors by the global notification deadline of March 3, 2024.
Speakers
Organizers
Stanford University
University of Washington
Princeton University
Carnegie Mellon University
OpenAI
Carnegie Mellon University
Stanford University
Stanford University