Understanding and Improving Generalization in Deep Learning

Invited Speakers

Mikhail Belkin

(Ohio State University)

Title: A Hard Look at Generalization and its Theories

Abstract: "A model with zero training error is overfit to the training data and will typically generalize poorly" goes statistical textbook wisdom. Yet in modern practice over-parametrized deep networks with near perfect (interpolating) fit on training data still show excellent test performance. This fact is difficult to reconcile with most modern theories of generalization that rely on bounding the difference between the empirical and expected error. Indeed, as we will discuss, bounds of that type cannot be expected to explain generalization of interpolating models. I will proceed to show how classical and modern models can be unified within a new "double descent" risk curve that extends the usual U-shaped bias-variance trade-off curve beyond the point of interpolation. This curve delimits the regime of applicability of classical bounds and the regime where new analyses are required. I will give examples of first theoretical analyses in that modern regime and discuss the (considerable) gaps in our knowledge. Finally I will briefly discuss some implications for optimization.

Bio: Mikhail Belkin is a Professor in the departments of Computer Science and Engineering and Statistics at the Ohio State University. He received a PhD in mathematics from the University of Chicago in 2003. His research focuses on understanding the fundamental structure in data, the principles of recovering these structures and their computational, mathematical and statistical properties. This understanding, in turn, leads to algorithms for dealing with real-world data. His work includes algorithms such as Laplacian Eigenmaps and Manifold Regularization based on ideas of classical differential geometry, which have been widely used for analyzing non-linear high-dimensional data. He has done work on spectral methods, Gaussian mixture models, kernel methods and applications. Recently his work has been focussed on understanding generalization and optimization in modern over-parametrized machine learning. Prof. Belkin is a recipient of an NSF Career Award and a number of best paper and other awards and has served on the editorial boards of the Journal of Machine Learning Research and IEEE PAMI.

Chelsea Finn

(UC Berkeley & Google Brain)

Title: Training for Generalization

Abstract: TBA.

Bio: Chelsea Finn is a research scientist at Google Brain, a post-doc at Berkeley AI Research Lab (BAIR), and will join the Stanford Computer Science faculty in Fall 2019. Finn’s research studies how new algorithms can enable machines to acquire intelligent behavior through learning and interaction, allowing them to perform a variety of complex sensorimotor skills in real-world settings. She has developed deep learning algorithms for concurrently learning visual perception and control in robotic manipulation skills, inverse reinforcement methods for scalable acquisition of nonlinear reward functions, and meta-learning algorithms that can enable fast, few-shot adaptation in both visual perception and deep reinforcement learning. Finn’s research has been recognized through an NSF graduate fellowship, the C.V. Ramamoorthy Distinguished Research Award, and the Technology Review 35 Under 35 Award, and her work has been covered by various media outlets, including the New York Times, Wired, and Bloomberg. With Sergey Levine and John Schulman, she also designed and taught a course on deep reinforcement learning, with thousands of followers online.

Finn received a PhD in Computer Science from UC Berkeley and a S.B. in Electrical Engineering and Computer Science from MIT.

Sham Kakade

(U Washington)

Title: Prediction, Learning, and Memory

Abstract: Building accurate language models that capture meaningful long-term dependencies is a core challenge in language processing. We consider the problem of predicting the next observation given a sequence of past observations, specifically focusing on the question of how to make accurate predictions that explicitly leverage long-range dependencies. Empirically, and perhaps surprisingly, we show that state-of-the-art language models, including LSTMs and Transformers, do not capture even basic properties of natural language: the entropy rates of their generations drift dramatically upward over time. We also provide provable methods to mitigate this phenomenon: specifically, we provide a calibration-based approach to improve an estimated model based on any measurable long-term mismatch between the estimated model and the true underlying generative distribution. More generally, we will also present fundamental information theoretic and computational limits of sequential prediction with a memory.

Bio: Sham Kakade is a Washington Research Foundation Data Science Chair, with a joint appointment in the Department of Computer Science and the Department of Statistics at the University of Washington. He works on the theoretical foundations of machine learning, focusing on designing provable and practical statistically and computationally efficient algorithms. Amongst his contributions, with a diverse set of collaborators, are: establishing principled approaches in reinforcement learning (including the natural policy gradient, conservative policy iteration, and the PAC-MDP framework); optimal algorithms in the stochastic and non-stochastic multi-armed bandit problems (including the widely used linear bandit and the Gaussian process bandit models); computationally and statistically efficient tensor decomposition methods for estimation of latent variable models (including estimation of mixture of Gaussians, latent Dirichlet allocation, hidden Markov models, and overlapping communities in social networks); faster algorithms for large scale convex and nonconvex optimization (including how to escape from saddle points efficiently). He is the recipient of the IBM Goldberg best paper award (in 2007) for contributions to fast nearest neighbor search and the best paper, INFORMS Revenue Management and Pricing Section Prize (2014). He has been program chair for COLT 2011.

Sham completed his Ph.D. at the Gatsby Computational Neuroscience Unit at University College London, under the supervision of Peter Dayan, and he was a postdoc at the Dept. of Computer Science, University of Pennsylvania , under the supervision of Michael Kearns. Sham was an undergraduate at Caltech , studying in physics under the supervision of John Preskill. Sham has been a Principal Research Scientist at Microsoft Research, New England, an associate professor at the Department of Statistics, Wharton, UPenn, and an assistant professor at the Toyota Technological Institute at Chicago.

Jason Lee

(USC)

Title: On the Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Abstract: Deep Learning has had phenomenal empirical successes in many domains including computer vision, natural language processing, and speech recognition. To consolidate and boost the empirical success, we need to develop a more systematic and deeper understanding of the elusive principles of deep learning. In this talk, I will provide analysis of several elements of deep learning including non-convex optimization, overparametrization, and generalization error. First, we show that gradient descent and many other algorithms are guaranteed to converge to a local minimizer of the loss. For several interesting problems including the matrix completion problem, this guarantees that we converge to a global minimum. Then we will show that gradient descent converges to a global minimizer for deep overparametrized networks. Finally, we analyze the generalization error by showing that a subtle combination of SGD, logistic loss, and architecture combine to promote large margin classifiers, which are guaranteed to have low generalization error. Together, these results show that on overparametrized deep networks SGD finds solution of both low train and test error.

Bio: Jason Lee is an assistant professor in Data Sciences and Operations at the University of Southern California. Prior to that, he was a postdoctoral researcher at UC Berkeley working with Michael Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in statistics, machine learning, and optimization. Lately, he has worked on high dimensional statistical inference, analysis of non-convex optimization algorithms, and theory for deep learning.

Aleksander Mądry

(MIT)

Title: Are All Features Created Equal?

Abstract: TBA.

Bio: Aleksander Mądry is an Associate Professor of Computer Science in the MIT EECS Department and a principal investigator in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He received his PhD from MIT in 2011 and, prior to joining the MIT faculty, he spent some time at Microsoft Research New England and on the faculty of EPFL.

Aleksander’s research interests span algorithms, continuous optimization, science of deep learning and understanding machine learning from a robustness perspective.

Daniel Roy

(U Toronto)

Title: Progress on Nonvacuous Generalization Bounds

Abstract: Generalization bounds are one of the main tools available for explaining the performance of learning algorithms. At the same time, most bounds in the literature are loose to an extent that raises the question as to whether these bounds actually have any explanatory power in the nonasymptotic regime of actual machine learning practice. I'll report on progress towards developing bounds and techniques---both statistical and computational---aimed at closing the gap between empirical performance and theoretical understanding.

Bio: Daniel Roy is an Assistant Professor in the Department of Statistical Sciences and, by courtesy, Computer Science at the University of Toronto, and a founding faculty member of the Vector Institute for Artificial Intelligence. Daniel is a recent recipient of an Ontario Early Researcher Award and Google Faculty Research Award. Before joining U of T, Daniel held a Newton International Fellowship from the Royal Academy of Engineering and a Research Fellowship at Emmanuel College, University of Cambridge. Daniel earned his S.B., M.Eng., and Ph.D. from the Massachusetts Institute of Technology: his dissertation on probabilistic programming won an MIT EECS Sprowls Dissertation Award. Daniel's group works on foundations of machine learning and statistics.