Spring 2019

Organizer for Spring 2019: Luis Rademacher (Email: lrademac@ucdavis.edu)

Apr 2: Michał Dereziński (UC Berkeley). Unbiased estimates for linear regression via volume sampling.

We study the following basic machine learning task: Given a fixed set of n input points in a d-dimensional linear regression problem, we wish to predict a hidden response value for each of the points. We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset. The performance of the predictions is evaluated by the total square loss on all responses. We show that a good approximate solution to this least squares problem can be obtained from just dimension d many responses by using a joint sampling technique called volume sampling. Moreover, the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of optimal solution based on all n responses. This unbiasedness is a desirable property that is not shared by standard subset selection techniques.

Motivated by these basic properties, we develop a theoretical framework for studying volume sampling, which leads to a number of new expectation formulas and statistical guarantees that are of importance not only to least squares regression but also numerical linear algebra in general. Our methods lead to several extensions of volume sampling, including a regularized variant, and we propose the first efficient algorithms which make this technique a practical tool in the machine learning toolbox.

Apr 9: Levent Tunçel (U Waterloo). Total Dual Integrality for Convex, Semidefinite and Extended Formulations.

Within the context of characterizations of exactness of convex relaxations of 0,1 integer programming problems, we present a notion of total dual integrality for Semidefinite Optimization Problems (SDPs). Our notion generalizes the existing notion for Linear Programming problems (LPs), by relying on an “integrality constraint” for SDPs that is primal-dual symmetric. A key ingredient for the theory is a generalization to compact convex sets of a theorem of Hoffman for polytopes, fundamental for generalizing the polyhedral notion of total dual integrality introduced by Edmonds and Giles in the 1970's. We study the corresponding theory applied to SDP formulations for stable sets in graphs using the Lovasz theta function and show that total dual integrality in this case corresponds to the underlying graph being perfect. We also relate dual integrality of an SDP formulation for the maximum cut problem to bipartite graphs. Total dual integrality for extended formulations naturally comes into play in this context.

Based on joint work with Marcel de Carli Silva.

Apr 16: Anna Seigal (UC Berkeley) (in room MSB 2112) Learning Paths from Signature Tensors.

Matrix congruence can be extended to the setting of tensors. I describe how to apply methods from tensor decomposition, algebraic geometry and numerical optimization to this group action. Given a tensor in the orbit of another tensor, we can compute a matrix which transforms one to the other. The application is an inverse problem from stochastic analysis: the recovery of paths from their signature tensors of order three. I give identifiability results and recovery algorithms for piecewise linear paths, polynomial paths, and generic dictionaries. This is based on joint work with Max Pfeffer and Bernd Sturmfels.

Apr 23: Roman Vershynin (Math-Stat Colloquium)

Apr 30: Juan Pablo Vielma (MIT) Modeling Power of Mixed Integer Convex Optimization Problems And Their Effective Solution with Julia and JuMP.

More than 50 years of development have made mixed integer linear programming (MILP) an extremely successful tool. MILP’s modeling flexibility allows it describe a wide range of problems in business, engineering, science and statistics. Furthermore, while MILP is NP-hard, many of these problems are routinely solved in practice thanks to state-of-the-art solvers that nearly double their machine-independent speeds every year. Inspired by this success, the last decade has seen a surge of activity on the solution and application of mixed integer convex programming (MICP), which extends MILP’s versatility by allowing the use of convex constraints in addition to linear inequalities. In this talk we cover various recent developments concerning theory, algorithms and computation for MICP. In particular, given that solvers for MICP can be significantly more effective than those for more general non-convex optimization, one question we consider is that of "MICP representability": what classes of non-convex constraints can be modeled through MICP?

With regards to MICP representability we start by giving the first precise characterization of MICP representability with bounded integer variables. We then focus on general MICP representability with unbounded integer variables by presenting simple impediments to general MICP representability, considering regularity properties of general MICP representable sets, and discussing the role of irrational unbounded directions on this regularity. Finally, we cover various topics concerning the modeling and computational solution of MICP problems using the Julia programming language and the JuMP modeling language for optimization. In particular, we show how mixed integer optimal control problems where the variables are polynomials can be easily modeled and solved by seamlessly combining several Julia packages and JuMP extensions with the Julia-written MICP solver Pajarito.

May 7: Shuyang Ling (NYU). Solving Inverse Problems on Networks: Graph Cuts, Optimization Landscape, Synchronization.

Information retrieval from graphs plays an increasingly important role in data science and machine learning. This talk focuses on two such examples. The first one concerns the graph cuts problem: how to find the optimal k-way graph cuts given an adjacency matrix. We present a convex relaxation of ratio cut and normalized cut, which gives rise to a rigorous theoretical analysis of graph cuts. We derive deterministic bounds of finding the optimal graph cuts via a spectral proximity condition which naturally depends on the intra-cluster and inter-cluster connectivity. Moreover, our theory provides theoretic guarantees for spectral clustering and community detection under stochastic block model.

The second example is about the landscape of a nonconvex cost function arising from group synchronization and matrix completion. This function also appears as the energy function of coupled oscillators on networks. We study how the landscape of this function is related to the underlying network topologies. We prove that the optimization landscape has no spurious local minima if the underlying network is a deterministic dense graph or an Erdos-Renyi random graph. The results find applications in signal processing and dynamical systems on networks.

May 14: (Thurston Lecture)

May 21: John Carlsson (USC). The continuous approximation paradigm for logistical optimization problems.

In recent years, some of the most talked-about developments in the transportation sector include the use of drones, the introduction of last-mile delivery services, and the use of large-scale mapping data. Along with these new developments comes a host of new problems and trade-offs. We will discuss two such problems and use the "continuous approximation paradigm" to reveal basic insights about those factors that influence them most significantly.

May 28: Lin Xiao (Microsoft Research). Accelerated Bregman Proximal Gradient Method for Relatively Smooth Convex Optimization.

We consider the problem of minimizing the sum of two convex functions: one is differentiable and relatively smooth with respect to a reference convex function, and the other can be nondifferentiable but simple to optimize. The relatively smooth condition is much weaker than the standard assumption of uniform Lipschitz continuity of the gradients, thus significantly increases the scope of potential applications. We present accelerated Bregman proximal gradient (ABPG) methods that employ the Bregman distance of the reference function as the proximity measure. The convergence rate of this method is determined by a triangle scaling property of the Bregman distance. While we cannot prove a priori guarantee yet, the adaptive variants of ABGP generate simple posterior numerical certificates for the actual attained convergence rates. We present numerical experiments with three applications: D-optimal experiment design, Poisson linear inverse problem, and relative-entropy nonnegative regression. In all experiments, we obtain numerical certificates for the O(1/k^2) rate, matching the optimal rate for minimizing uniformly smooth functions. This is joint work with Filip Hanzely and Peter Richtarik.

Jun 4: Michael Mahoney (UC Berkeley). Why Deep Learning Works: Traditional and Heavy-Tailed Implicit Self-Regularization in Deep Neural Networks.

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a ``size scale'' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit self-regularization can depend strongly on the many knobs of the training process. In particular, by exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that---all else being equal---DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena. Joint work with Charles Martin of Calculation Consulting, Inc.

Michael W. Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council's Committee on the Analysis of Massive Data, he co-organized the Simons Institute's fall 2013 and 2018 programs on the foundations of data science, and he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets. He is currently the Director of the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. More information is available at https://www.stat.berkeley.edu/~mmahoney/.

Jun 11: Alexander Cloninger (UCSD). Low Frequency Kernel Eigenfunctions in Statistical Distances, Function Sampling, and Data Generation.

We will discuss several topics related to the importance of using low frequency eigenfunctions of the Gaussian kernel. First, we will discuss the relevance of selecting important eigenfunctions for two sample testing and statistical distances, namely kernel Maximum Mean Discrepancy. This creates a more powerful test than the classical MMD while still maintaining sensitivity to common departures. Second, we derive an algorithm for sub-sampling smooth functions (i.e. functions that are linear combinations of the low frequency eigenfunctions) that provably upper bounds the error in estimating the mean of the function, and use this to bound the error from subsampled MMD. Finally, we will discuss an efficient method for sampling new points from the space spanned by the low frequency eigenfunctions, and use a deep learning framework similar to Variational Autoencdoers to map these new points back to the original data space in a way that respects the original data geometry.