This workshop solicits contributions that bridge the gap between deep learning theory and the modern practice of deep learning in an effort to build a mathematical theory of machine learning that can both explain and inspire modern practice. We welcome new mathematical analyses that bridge the gap between existing theory and modern practice, as well as empirical findings that challenge existing theories and offer avenues for future theoretical investigations.
This Year's New Focus. This year's M3L will focus the contributed talks on “Mathematical Models for Deep Learning Phenomena,” where theoretical analyses can provide direct explanations and implications to empirical observations. Empirical deep learning research has identified many surprising phenomena, even in previously well-studied settings. Many of these phenomena pose real challenges to existing optimization and generalization theories, including the failure of uniform convergence, the ubiquitous existence of adversarial examples, generalization degradation in large-batch training, and grokking, among others. Additionally, large foundation models integrate various learning paradigms, which usually include unsupervised pertaining, supervised fine-tuning, RLHF. These models also exhibit “emergent” abilities that appear to be mysterious in theory, such as in-context or few-shot learning.
We encourage paper submissions that propose mathematical models to capture interesting and important phenomena in deep learning, preferably with predictions and insights into advancing the practice.
Full Paper Submission Deadline: Oct 1, 2024 (AoE) Sept 29, 2024 (AoE)
Review Period: Oct 2, 2024 - Oct 8, 2024
Accept/Reject Notification Date: Oct 9, 2024
Uploading Camera-Ready Submissions: Dec 1, 2024
Submission link: https://openreview.net/group?id=NeurIPS.cc/2024/Workshop/M3L
Submission Format:
The reviewing process will be double-blind and all submissions must be anonymized. Please do not include author names, author affiliations, acknowledgments, or any other identifying information in your submission. Submissions and reviews will not be made public. Only accepted papers will be made public.
Main Paper Length: All submissions must be in PDF format and are required to use the LaTeX style file. We highly recommend the submissions to be at most 6 pages long, including figures and tables. Papers can exceed the 6-page limit, but the reviewers are not required to read beyond the first 6 pages. Papers that contain more than 10 pages will be desk rejected.
Unlimited additional pages are allowed for references and supplementary materials. Please include the references and supplementary materials in the same PDF as the main paper.
Dual Submissions: This workshop is non-archival and will not have official proceedings. Workshop submissions can be submitted to other venues. We welcome ongoing and unpublished work, including papers that are under review at the time of submission. We do not accept submissions that have already been accepted for publication in other venues with archival proceedings. The only exception is that NeurIPS 2024 main conference papers can be submitted concurrently to this workshop, but only in the form of short versions with a strict page limit of 6 pages.
This workshop's main areas of focus include but are not limited to:
Reconciling Optimization Theory with Deep Learning Practice
Convergence analysis beyond the stable regime: How do optimization methods minimize training losses despite large learning rates and large gradient noise? How should we understand the Edge of Stability (EoS) phenomenon? What could be more realistic assumptions for the loss landscape and gradient noise that foster training algorithms with faster convergence both in theory and practice?
Continuous approximations of training trajectories: Can we obtain insights into the discrete-time gradient dynamics by approximating them with a continuous counterpart, e.g., gradient flow or SDE? When is such an approximation valid?
Advanced optimization algorithms: Why does Adam optimize faster than SGD on Transformers? Under what theoretical models can we design advanced optimization methods (e.g., adaptive gradient algorithms, second-order algorithms, distributed training algorithms) that provably work better?
Generalization for Overparametrized Models
Implicit bias: Whether and how do gradient-based algorithms implicitly pick the solution with good generalization, despite a rich set of non-generalizing minimizers?
Generalization Measures: What is the relationship between generalization performances and common generalization measures? (e.g., sharpness, margin, norm, etc.) Can we prove non-vacuous generalization bounds based on these generalization measures?
Roles of Key Components in Algorithm and Architecture: What are the roles of initialization, learning rate warmup and decay, and normalization layers?
Intriguing phenomena of foundation models
Pretraining: What do foundation models learn in pretraining that allows for efficient finetuning? How does the choice of dataset/architecture affect this?
Effect of Data: How does the number of data passes affect training, and can we consolidate the empirical and theoretical understanding? How should the use of data differ during and after pretraining?
Multimodal Representations: How can we learn representations from multimodal data?
Scaling Laws and Emergent Phenomena: How and why does the performance scale with data, compute, and model size? What mathematical models should we use to understand emergent abilities such as in-context and few-shot reasoning?
Diffusion Models: What do we understand about the success and limitations of diffusion models and score-matching methods?
Provable Guarantees Beyond Supervised Learning Settings
Online Learning and Reinforcement Learning: How is learning affected by various factors such as expert feedback quality or data coverage? How should theory tools be adapted to inform modern use cases such as RLHF?
Representation Learning and Transfer Learning: What properties of the source and target tasks allow for efficient transfer learning? What types of representations can be learned via self-supervised learning (e.g., contrastive learning)
Multitask and Continual Learning: What conditions are needed to adapt a model to new tasks while preserving the performance of old tasks? What view should we take to understand modern notions of multitask and continual learning, where assumptions could diviate greatly from classic theory?