Scaling Laws and Foundation Models
IFT 6760B & 6167 Winter 2022, Université de Montréal / Mila - Quebec AI Institute
Course Description Topics&Papers Schedule Invited Talks Reading Groups
Here is a suggested list of topics and papers - still UNDER CONSTRUCTION.
If you would like to suggest a relevant paper not in the list, please contact the instructor and/or the TAs (contact info on the course descriptions page). Here is paper presentation schedule & sign up sheet
Other Relevant Courses
2022 AI Safety Fundamentals course at Cambridge
https://github.com/jacobhilton/deep_learning_curriculum
Alternative points of view and criticism of large-scale models
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
Colin Raffel: talks
A few possibly controversial opinions about large language models at Carnegie Mellon University Language Technologies Topical Seminar, 2021.
The Sweet Lesson at SustaiNLP Workshop, 2021.
What do language models learn from language modeling? at Stanford University CS 330 Lecture, 2021.
How and why should(n't) we scale machine learning? at IBM AI Hardware Forum Keynote, 2021.
A better way to get language models to do what you ask at AKBC 2021 Unstructured and Structured Knowledge Bases Workshop and Cohere.ai, 2021.
Scaling up Models and Data at CIFAR Deep Learning and Reinforcement Learning Summer School, Nepal Winter School in AI, and Advanced Language Processing Winter School, 2021.
Videos: Talks, Tutorials, Demos
GPT-3 Language Models are Few-Shot Learners (Paper Explained) - by Yannic Kilcher
Recent Large-Scale Pretrained Models (a.k.a. Foundation Models)
On The Opportunities and Risks of Foundation Models
A report by Stanford's Center for Research on Foundation Models (CRFM)AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
CLIP paper: Learning Transferable Visual Models From Natural Language Supervision (OpenAI blog)
DALL-E paper: Zero-Shot Text-to-Image Generation (OpenAI blog)
Training language models to follow instructions with human feedback
Foundation Models, Scaling and Reinforcement Learning
Can Wikipedia Help Offline Reinforcement Learning?
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
Training language models to follow instructions with human feedback
Offline Pre-trained Multi-Agent Decision Transformer
Fine-Tuning Language Models from Human Preferences
AI, Philosophy & Ethics
Scaling Laws in Natural and Artificial Systems
Broken Power Laws:
astrophysics https://www.aanda.org/articles/aa/olm/2011/02/aa15581-10/aa15581-10.html
material science: https://tel.archives-ouvertes.fr/tel-01037944/document
socio-economics https://www.sciencedirect.com/science/article/abs/pii/S0378437119317935
Scale invariance in natural and artificial collective systems: a review
“It’s like there’s this universal scaling law of cortical maturation.” https://www.spectrumnews.org/news/new-studies-reveal-how-autism-might-alter-synapse-formation-pruning/ New studies reveal how autism might alter synapse formation, pruning
More on criticality:
Beggs, J. M. (2008). The criticality hypothesis: how local cortical networks might optimize information processing. Philos. Trans. A Math. Phys. Eng. Sci.366, 329–343. doi: 10.1098/rsta.2007.2092
Shew, W. L., and Plenz, D. (2013). The functional benefits of criticality in the cortex. Neuroscientist19, 88–100. doi: 10.1177/1073858412445487
Shew, W. L., Yang, H., Petermann, T., Roy, R., and Plenz, D. (2009). Neuronal avalanches imply maximum dynamic range in cortical networks at criticality. J. Neurosci.29, 15595–15600. doi: 10.1523/JNEUROSCI.3864-09.2009
Shew, W. L., Yang, H., Yu, S., Roy, R., and Plenz, D. (2011). Information capacity and transmission are maximized in balanced cortical networks with neuronal avalanches. J. Neurosci.31, 55–63. doi: 10.1523/JNEUROSCI.4637-10.2011
MultiModal Transformers
https://jalammar.github.io/illustrated-transformer/A survey on visual transformer
https://jalammar.github.io/illustrated-transformer/
Data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Time-series Transformers: a Survey
Multivariate Time Series Forecasting with Latent Graph Inference
SCINet (https://arxiv.org/pdf/2106.09305.pdf)
ETSformer (https://arxiv.org/abs/2202.01381)
Pyraformer (https://openreview.net/pdf?id=0EXmFzUn5I)
Informer (https://arxiv.org/abs/2012.07436)
Reformer (https://arxiv.org/pdf/2001.04451.pdf)
N-HiTS (https://arxiv.org/pdf/2201.12886.pdf)
Autoformer (https://arxiv.org/pdf/2106.13008.pdf)
LogTrans (https://arxiv.org/pdf/1907.00235.pdf)
GLR local global ts representations (https://arxiv.org/pdf/2202.02262.pdf)
TACTiS (https://arxiv.org/pdf/2202.03528.pdf)
MQTransformer (https://arxiv.org/pdf/2009.14799.pdf)
ProTran (https://proceedings.neurips.cc/paper/2021/file/c68bd9055776bf38d8fc43c0ed283Paper.pdf)
Preformer (https://arxiv.org/pdf/2202.11356.pdf)
Spacetimeformer (https://arxiv.org/pdf/2109.12218.pdf)
Neural Scaling Laws
Beyond neural scaling laws: beating power law scaling via data pruning
A Neural Scaling Law from the Dimension of the Data Manifold
A constructive prediction of the generalization error across scales
Jonathan Rosenfeld's PhD thesis on Scaling Laws for Deep Learning
Deconstructing Distributions: A Pointwise Framework of Learning
Generalization (In- and Out-of-Distribution)
Rethinking Bias-Variance Trade-off for Generalization of Neural Networks
Bias and Generalization in Deep Generative Models: An Empirical Study
Generalizing to Unseen Domains: A Survey on Domain Generalization
Towards a Theoretical Framework of Out-of-Distribution Generalization
OoD-Bench: Benchmarking and Understanding Out-of-Distribution Generalization Datasets and Algorithms
More at https://sites.google.com/site/irinarish/ood_generalization (a subset to be selected)
Other possible candidates:
https://arxiv.org/abs/2109.03795
https://arxiv.org/abs/2007.01434
https://arxiv.org/abs/2102.11436
https://arxiv.org/abs/2107.12580
https://arxiv.org/abs/2108.12284?context=cs.AI
http://proceedings.mlr.press/v119/sastry20a.htmlhttp://arxiv.org/abs/2106.03721
Continual- and Meta-Learning
Scaling and Continual Learning
Don’t Stop Learning: Towards Continual Learning for the CLIP Model
Effect of scale on catastrophic forgetting in neural networks
Effects of Model and Prior Learning Scale on Catastrophic Forgetting
Embracing Change: Continual Learning in Deep Neural Networks
Towards Continual Reinforcement Learning: A Review and Perspectives
Book (1st book on the topic): Lifelong Machine Learning
Continual learning: A comparative study on how to defy forgetting in classification tasks
Class-incremental learning: survey and performance evaluation
Never-Ending Learning (tutorial by Tom Mitchell and Partha Talukdar, ICML 2019)
Continual Learning with Deep Architectures (tutorial by Irina Rish and Vincenzo Lomonaco, ICML 2021)
Continual Lifelong Learning in Natural Language Processing: A Survey
Drinking from a Firehose: Continual Learning with Web-scale Natural Language
Pretrained Language Model in Continual Learning: A Comparative Study