Neural Scaling Seminar Series

EleutherAI  (Wed, 02/03/2022, noon  EST,  Mila Public Calendar    

Speaker:  Stella Biderman @ eleuther.ai

Title: EleutherAI meets Mila

Abstract: An overview of ongoing projects at eleuther.ai, including the recently released open-source 20B-parameter GPT-Neo (GPU-based) and an earlier 6B-parameter GPT-j (TPU-based).

Video: here (Mila internal)

Aleph Alpha  (Thu, 03/03/2022, noon EST, Mila Public Calendar 

Speakers: Jonas Andrulis (CEO) and Robert Baldock (senior researcher) @ aleph-alpha.de

Title:  Aleph Alpha meets Mila

Abstract: An overview of ongoing projects at aleph-alpha.com, including recent work on MAGMA, a large-scale multimodal transformer.

Video: here  (Mila internal)

OpenAI  (Fri, 03/25/2022, 10:30 AM EST  Mila Public Calendar 

Speakers: Ilya Sutskever Co-founder and Chief Scientist @ openai.com

Title:  The incredible power of deep learning

Abstract:  An overview of OpenAI's recent work on large neural networks and a discussion of some of their future potential. Ilya will give a rather short talk/intro (~20min) which will be followed by a longer questions/discussion session. Streaming Link here.  The talk will  NOT be recorded.

MILA  (Thu, 03/31/2022, 10:00 AM Mila Public Calendar 

Speakers: Edward Hu PhD Student @ mila.quebec

Title:  Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Abstract: You can’t train GPT-3 on a single GPU, much less tune its hyperparameters (HPs)… or so it seems. I’m here to tell you this is not true: you can tune its HPs on a single GPU even if you can’t train it that way! In this talk, I’ll describe how, in the so-called maximal update parametrization (abbreviated µP), narrow and wide neural networks share the same set of optimal HPs. This lets us tune any large model by just tuning a small version of it — we call this µTransfer. In particular, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and, with some asterisks, we get a performance comparable to the original GPT-3 model with twice the parameter count. This talk is based on https://arxiv.org/abs/2203.03466.

Video: here

Anthropic  (Thu 04/21/2022,  3:00 PM EST, Mila Public Calendar 

Speakers: Jared Kaplan Co-founder @ anthropic.com

Title: How to Train a Helpful and Harmless Assistant

Abstract:  We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

This talk is based on https://arxiv.org/abs/2204.05862.

Meeting link: https://umontreal.zoom.us/j/3046597454?pwd=REtaVW1NaXdqK3pEMTNrNnB6dnJDdz09

Video: here

lighton.ai  (Wed 05/25/2022, 11:00am EST, Mila Public Calendar

Speakers: Julien Launay industrial PhD student @ lighton.ai

Title: What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Abstract: Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at this https URL

This talk is based on https://arxiv.org/abs/2204.05832.

Meeting link: https://umontreal.zoom.us/j/3046597454?pwd=REtaVW1NaXdqK3pEMTNrNnB6dnJDdz09

Video: here

Microsoft Research (Wed 06/01/2022, 3:00pm EST, Mila Public Calendar

Speaker: Greg Yang, a researcher @ Microsoft Research in Redmond, Washington

Title: The Unreasonable Effectiveness of Mathematics in Large Scale Deep Learning

Abstract: Recently, the theory of infinite-width neural networks led to the first technology, muTransfer, for tuning enormous neural networks that are too expensive to train more than once. For example, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and with some asterisks, we get a performance comparable to the original GPT-3 model with twice the parameter count. In this talk, I will explain the core insight behind this theory. In fact, this is an instance of what I call the *Optimal Scaling Thesis*, which connects infinite-size limits for general notions of “size” to the optimal design of large models in practice, illustrating a way for theory to reliably guide the future of AI. I’ll end with several concrete key mathematical research questions whose resolutions will have an incredible impact on how practitioners scale up their NNs.

Meeting link: https://umontreal.zoom.us/j/3046597454?pwd=REtaVW1NaXdqK3pEMTNrNnB6dnJDdz09

Video: TBD 

University of Cambridge  (TBA 

Speakers: David Krueger University Lecturer @ University of Cambridge

Video: TBD 

IBM  (TBA 

Speakers: Guillermo Cecchi Principal Research Staff Member @ research.ibm.com

Video: TBD 

Laion  (TBA 

Speakers: Jenia Jitsev and Christoph Schuhman Scientific Lead and Organizational Lead @ laion.ai

Video: TBD 

Hugging Face  (TBA 

Speakers: Thomas Wolf  Science Team Lead @ huggingface.co

Video: TBD 

AI Sweden  (TBA 

Speakers:

Video: TBD