Animashree AnandkumarTitle: Recent Advances and Challenges in Non-Convex Optimization (Click here for the talk slides)Abstract: Optimization lies at the heart of machine learning. The difficulty in solving many machine learning tasks stems directly from the non-convexity of the corresponding optimization problem. Non-convex optimization suffers from numerous critical points: local optima and saddle points. Due to "curse of dimensionality", the number of critical points can grow exponentially in the input dimension. Local search methods get stuck in bad local optima. In addition, saddle points can slow down convergence. How do we overcome the hardness in non-convex problems?Matrix and tensor decompositions form a special sub-class of non-convex optimization, where we can prove convergence to globally optimal solution under mild conditions. These results have led to some recent breakthroughs in guaranteed learning of probabilistic models and neural networks. More generally, a number of strategies have been proposed to deal with non-convexity: random restarts, problem dependent initialization, trust region methods, noisy stochastic gradient descent, annealing, and continuation techniques. Some of these methods will be covered in detail by the invited speakers in the workshop. Finally, I will present some open problems in the area. Yann LeCunTitle: Large-Scale Optimization for Deep Learning (Click here for the talk slides)Abstract:Greg ValiantAbstract: Suppose one wishes to estimate a structural property or optimize a function defined in terms of an object or distribution, given only data drawn from the object in question according to some stochastic process. How is the optimization or estimation task complicated by this "noisy" access to the underlying object? We focus on one specific instantiation of this general question: consider the problem of recovering a low-rank approximation to a matrix B of size M × M, which represents a probability distribution over M^2 outcomes, given access to an observed matrix of “counts” generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This problem lies at the core of a number of questions that are currently being considered by different communities, including community detection in graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute “word embeddings”. Many aspects of this problem—both in terms of learning and property testing/estimation and on both the algorithmic and information theoretic sides—remain open. We will present some recent progress on this question, and discuss several intriguing open directions related to the more general goal of achieving computationally efficient and information theoretically near-optimal algorithms for extracting structure underlying data.This is based on joint work with Qingqing Huang, Sham Kakade, and Weihao Kong Sanjeev Arora
Abstract: The theoretical difficulties for nonconvex optimization arise from a very good reason: they are NP-hard. This intractability on worst-case inputs suggests that to make theoretical progress, we should identify special properties of real-life inputs, so that the worst-case complexity results no longer apply. The talk will survey a few successful attempts along these lines, in contexts ranging from nonnegative matrix factorization, sparse coding/dictionary learning, and unsupervised deep learning, where some provable algorithms were designed this way.Hossein MobahiAbstract: I introduce a theory of nonconvex optimization based on solving a sequence of subproblems with increasing complexity that converge to the main problem. The solution of each intermediate subproblem guides solving the next. Variants of this idea include deterministic annealing, mean field annealing, graduated nonconvexity, etc. The success of this approach in finding a good minimum critically depends on the choice of subproblems. It is not clear if the mentioned variants lead to optimal subproblem construction. Instead, the theory I present implies a specific construction of the subproblems that is optimal in a certain sense. The construction relies on the diffusion process and is backed up by results in partial differential equations and the notion of convex envelope. This construction can lend itself to computationally efficient algorithms. I conclude by applying this theory to a few applications such as image alignment and deep learning, show how they improve performance, and discuss intriguing observations. Related to the latter, this theory may justify why the foveal vision helps seeing, why SIFT descriptor in computer vision works so well and how it can be improved, and why gradual learning rate reduction and layerwise pretraining help deep learning. This is joint work with John W. Fisher. Andrew R. BarronAbstract: Previously, greedy algorithms have been shown to have good approximation and estimation properties for superpositions of a sigmoidal function or other ridge activation functions. In each step the parameters of a new sigmoid are fit to the residuals of the previous sigmoids. However, that remains a challenging task to produce guarantees for the parameter search for each sigmoid. Here we discuss developments of two algorithms for this task. One is an implementation of adaptive annealing, in which internal modifications of the sigmoid is made, including parameter squashing and variable replication that permit individual parameter effects to be small, allowing stability of the adaptive annealing solution, at the expense of increasing dimensionality. The other algorithm is a convergent nonlinear modification of the tensor methods of Anandkumar and her colleagues which allows optimization of the inner product of the residuals and an certain forms of sigmoids for data from Gaussian or other rotationally invariant distributions. This work is joint with Jason Klusowski.Chris ReTitle: Understanding Convergence with Fractional Hypertree Width and Martingale Methods (Click here for the talk slides)Abstract: It's an amazing time for machine learning applications. In problems from text extraction to imaging, we can build applications that meet and in someways exceed human volunteer quality. However, lurking underneath most of the popular frameworks (Gibbs sampling, Matrix Factorization, or Deep Learning) is something troubling: our current theoretical results are too weak to tell us that our algorithms converge nearly as fast as we rely on in applications. This talk describes two of our, still nascent, attacks on understanding this gap. One attack is based on combinatorial widths, a notion from logic, that we've recently used to analyze Gibbs sampling and belief propagation. The second is a set of martingale methods for convergence of some mildly non-convex problems.This talk will reference applications at http://deepdive.stanford.edu/, which is the work of a number of people. Kevin C. ChenTitle: Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data (Click here for the talk slides)Abstract: We are currently working on developing and applying spectral learning algorithms to epigenetics data. Recently, international consortia such as ENCODE and Roadmap Epigenomics have released massive epigenetics data sets from hundreds of human cell types with the aim of interpreting Genome-wide Association Studies for many human diseases. To analyze this data, we have implemented and extensively tested spectral algorithms for HMMs in our Spectacle software and found that they have significantly improved run time and biological interpretability compared to the EM algorithm. This is particularly important when the underlying classes are highly imbalanced, a pervasive issue in biology. To model multiple cell types, we developed novel spectral algorithms for tree structured HMMs. One of the projection techniques can be applied to other spectral algorithms and may be of independent interest. |