Predicting Generalization in Deep Learning

Competition at NeurIPS 2020


12/17: The datasets used for all phases of the competition are now available!

12/13: Technical reports submitted by the competitors and the recording of the live event are now available!

11/12: Winners are announced! Congratulations to all the winners!

11/1: The competition has ended.


Generalization is one of the most important topics in machine learning. Generalization of deep neural networks defies "conventional" wisdom. Numerous bounds have been proposed but the large majority of them greatly overestimate the observed generalization performance of the models. It is unclear how to compare them since most of the time the numerical values are not reflective of how good the bounds are. In this competition, we aim to provide a platform for rigorously studying generalization of deep neural networks. We invite both practitioners and theoreticians to design the best complexity measures for deep neural networks. We hope that the competition will instigate progress in understanding and explaining generalization of deep neural networks. Details of the competition can be found here. You can find the competition on Codalab here.


In this competition, the competitors are asked to write a Python function whose input is a trained neural network and its training data and output is a complexity measure or generalization predictor that quantifies how well the trained model generalizes on the test data. The competition will be separated into 2 phases: development phase and evaluation phase each with its own set of neural networks. In the development phase, the competitor can submit their solutions which are evaluated on the private dataset 1, which contain different neural network architectures trained on different data from the data provided to competitor. The competitors can only submit a fixed number of solutions everyday and the submission must finish within a given time budget. In the evaluation phase, the competitors have a limited number of chance to submit new solutions. The solutions in this phase are first run on development phase data. If they finish within the time budget given in development phases, they will be evaluated on private dataset 2 without any time limit. Submissions will be done through Codalab at


A distinct feature of this competition is that every datum is a trained neural network trained on some training data. In this first iteration of the competition, we will be focusing on convolutional neural networks for image classification tasks. In addition, we will be focusing on sequential networks, which are class of models that do not have skip connections and can be expressed as a list of operations (e.g. convolutions and non-linearities). We will not be providing a "training dataset" because to be general, predictors should be sufficiently agnostic to the topologies of the models or the characteristics of the models' training data. Instead, we will be providing 2 collections of neural networks for testing and debugging the submissions. One set will be VGG-like models trained on CIFAR-10 and the other contain Network-in-Network-like models trained on SVHN. We do not expect the submissions to have any explicit dependencies on these two set of models, but we are also permitting parametric models (i.e. some kind of meta-model that are trained on features extracted from neural networks). Information private data used in the development phase and evaluation phase will be kept in secret. We will use the Tensorflow 2.0 Keras interface for all the models used in this competition.


The participants of this competition are required to submit a single python script that contains a function named complexity. The function should take in a Keras model and a keras dataset object which contains the training data of the particular model, and output a single real-valued scalar. This function will be run on all the models in the private dataset in a Docker container on our server. We will then rank models based on predicted complexity, and measure its consistency with the actual generalization performance. Details of the tasks and evaluation metrics can be found in this document.

Yiding Jiang

Carnegie Mellon University

Pierre Foret


Scott Yak


Behnam Neyshabur


Isabelle Guyon

University Paris-Saclay, France and ChaLearn USA

Hossein Mobahi


Gintare Karolina Dziugaite

Element AI

Daniel Roy

University of Toronto

Suriya Gunasekar

Microsoft Research

Samy Bengio



Twitter: @PGDL_NeurIPS