Past Talks

May 2:  Hung-Hsu Chou (Technical University of Munich)

More is Less: Understanding Compressibility of Neural Networks via Implicit Bias and Neural Collapse  (Watch the Video)

Despite their recent successes in various tasks, most modern machine learning algorithms lack theoretical guarantees, which are crucial to further development towards delicate tasks such as designing self-driving cars. One mysterious phenomenon is that, among infinitely many possible ways to fit data, the algorithms always find the "good" ones, even when the definition of "good" is not specified by the designers. In this talk I will cover the empirical and theoretical study of the connection between the good solutions in neural networks and the sparse solutions in compressed sensing with four questions in mind: What happens? When does it happen? Why does it happen? How can we improve it? The key concepts are implicit bias/regularization, Bregman divergence, neural collapse, and neural tangent kernel.

April 25:  Laura Balzano (University of Michigan)

Efficient Low-Dimensional Compression for Deep Overparameterized Learning and Fine-Tuning (Watch the Video)

While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the computational burdens. In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. Our approach is grounded in theoretical findings for deep overparameterized low-rank matrix recovery, where we show that the learning dynamics of each weight matrix are confined to an invariant low-dimensional subspace. Consequently, we can construct and train compact, highly compressed factorizations possessing the same benefits as their overparameterized counterparts. In the context of deep matrix completion, our technique substantially improves training efficiency while retaining the advantages of overparameterization. For language model fine-tuning, we introduce a method called “Deep LoRA”, which improves the existing low-rank adaptation (LoRA) technique, leading to reduced overfitting and a simplified hyperparameter setup, all while maintaining comparable efficiency. The effectiveness of Deep LoRA is validated through its performance on natural language understanding tasks, particularly when fine-tuning with a limited number of samples.

April 18:  Keaton Hamm (University of Texas at Arlington)

Tensor decompositions by mode subsamplingWatch the Video

We will overview variants of CUR decompositions for tensors. These are low-rank tensor approximations in which the constituent tensors or factor matrices are subtensors of the original data tensors. We will discuss variants of Tucker decompositions and those based on t-products in this framework. Characterizations of exact decompositions are given, and approximation bounds are shown for data tensors contaminated with Gaussian noise via perturbation arguments.  Experiments are shown for image compression and time permitting we will discuss applications to robust PCA.

April 4:  Rocio P Diaz Martin (Vanderbilt University)

Applications of Optimal Transport Theory (Watch the Video)

Optimal Transport (OT) theory has emerged as a powerful tool across various fields, including machine learning and signal processing analysis. We will start with an overview of OT theory, focusing on the Wasserstein distance for probability measures through Monge’s, Kantorovich’s, and dynamic formulations of the OT problem. We will emphasize why the Wasserstein distance is useful for discriminating distributions. After that, we will introduce an "embedding" or "transform" based on OT: the “Linear Optimal Transport Embedding” (LOT), which significantly reduces the computational cost of the Wasserstein distance in higher dimensions. We will show some applications of this tool to interpolation, and parameter estimation problems. If time permits, we will explore the balanced mass limitation in classical OT, thus delving into Partial Optimal Transport and Unbalanced Optimal Transport.

March 28:   Luis Rademacher (University of California, Davis)

On the Nyström Approximation for Preconditioning in Kernel Machines (Watch the Video)

Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets.

A Nyström approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. We analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nyström-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.

This is joint work with Amirhesam Abedsoltan, Mikhail Belkin and Parthe Pandit.

March 21Ron Levie (Technion)

Regularity of graph neural networks  (Watch the Video)

In recent years, graph neural networks (GNNs) have led to ground-breaking achievements in the applied sciences and industry. These achievements pose exciting theoretical challenges: can the success of GNNs be grounded in solid mathematical frameworks? 


A GNN is a function that takes graphs (with node features) and returns vectors in R^n. Since the input space of a GNN is non-Euclidean, i.e., graphs can be of any degree and any topology, less is known about GNNs than standard neural-networks. In this talk, we claim that past theories of GNNs were missing an important ingredient: meaningful notions of metric on the input space, namely,  graph similarity measures that are defined for all graphs of any degree, which respect and describe the behavior of GNNs. We will show how to construct and analyze such graph metrics using graphon theory and Szemerédi's regularity lemma. These metrics will reveal that GNNs behave regularly in some sense, which leads to generalization bounds, universal approximation theorems, expressivity results, and novel GNN designs.

March 14:  Kevin O'Neill (Yale)

Sketching the Heat Kernel, aka Randomly Embedding Data with Gaussian Processes (Watch the Video)

We introduce a novel method for embedding high-dimensional data in a low-dimensional Euclidean space. The method, motivated by theoretical results of Krishnan, Adler, and Taylor (2017), works by taking the coordinates of the embedding to be independent realizations of a Gaussian process on the data. By taking the covariance kernel of the Gaussian process to be heat kernel, the straight-line distance between points in the embedding is shown to approximate the well-known diffusion distance (see Coifman, Lafon 2006). Computing the embedding amounts to sketching the heat kernel matrix.


In this talk, we will cover the theoretical underpinnings of these random embeddings, with a focus on the ever-important Karhunen–Loève expansion for Gaussian processes. We will discuss experimental results which demonstrate an advantage over existing methods with regards to robustness to outliers, clustering post-embedding, and embedding of data with small-scale structure which should not be considered noise. This is joint work with Anna Gilbert.


March 7:  Qiang Sun (University of Toronto)

Understanding ensemble learning: Our first step  (Watch the Video)

Overparameterized models, aka interpolators, are unstable. For example, the mininum-norm least square interpolator exhibits unbounded test errors when dealing with noisy data. In this talk, we study how ensemble stabilizes and thus improves the generalization performance, measured by the out-of-sample prediction risk, of an individual interpolator. We focus on bagged linear interpolators, as bagging is a popular randomization-based ensemble method that can be implemented in parallel. We introduce the multiplier-bootstrap-based bagged least square estimator, which can then be formulated as an average of the sketched least square estimators. The proposed multiplier bootstrap encompasses the classical bootstrap with replacement as a special case, along with a more intriguing variant which we call the Bernoulli bootstrap. 


Focusing on the proportional regime where the sample size scales proportionally with the feature dimensionality, we investigate the out-of-sample prediction risks of the sketched and bagged least square estimators in both underparametrized and overparameterized regimes. Our results reveal the statistical roles of sketching and bagging. In particular, sketching modifies the aspect ratio and shifts the interpolation threshold of the minimum-norm estimator. However, the risk of the sketched estimator continues to be unbounded around the interpolation threshold due to excessive variance. In stark contrast, bagging effectively mitigates this variance, leading to a bounded limiting out-of-sample prediction risk. To further understand this stability improvement property, we establish that bagging acts as a form of implicit regularization, substantiated by the equivalence of the bagged estimator with its explicitly regularized counterpart. We also discuss several extensions.

February 29:  Karin Schnass (Universität Innsbruck)

Dictionary Learning - Sparse Matrix Factorisation (Watch the Video)

In this talk we will describe the problem of dictionary learning as the problem of factorising a short fat data matrix into an approximately square matrix, called dictionary, and another short fat but sparse matrix, called the sparse coefficients or codes. 

We will present simple algorithms for finding such a factorisation, that are based on the optimisation of

an energy functional and alternating minimisation. Keeping different variables fixed, leads to different algorithms such as the three golden classics of dictionary learning algorithms - the Method of Optimal Directions (MOD), the Online Dictionary Learning algorithm (ODL) and the K Singular Value Decomposition (KSVD). 

Finally, assuming that the data matrix follows a sparse model based on a well-behaved generating dictionary, where each of the K dictionary elements may be used with different probability, we show the following: Given enough training signals and starting with a well-behaved initialisation, that is either within distance at most 1/log(K) to the generating dictionary or has a special structure ensuring that each element of the initialisation dictionary corresponds to exactly one element of the generating dictionary, MOD and ODL converge with geometric convergence rate to the generating dictionary.

 

The presented results are joint work with Simon Ruetz.

February 22:  Sung Ha Kang  (Georgia Tech) 

Identifying Differential Equations with Numerical Methods: IDENT to WeakIDENT and More (Watch the Video)

We explore identifying underlaying differential equation from given single set of noisy time dependent data.  We assume that the governing equation is a linear combination of linear and nonlinear differential term, and the identification can be formulated as a linear system, with the feature matrix multiplied by a coefficient vector.  This talk will cover a collection of our recent work on this topic: starting from numerical time evolution (IDENT),  we consider robust identification methods using successively denoised differentiation and subspace pursuit, and using weak form for ODE and PDE recovery.  Using weak form shows robustness against higher level of noise and higher order derivative in underlying equation.   We further consider Group subspace pursuit for varying coefficient cases and using Fourier domain for identification. 

February 15:  Qiyu Sun (University of Central Florida)

Barron Space for Graph Convolution Neural Networks (Watch the Video)

Graph convolutional neural network  (GCNN) operates on graph domain and it has  achieved a superior performance to accomplish a wide range of tasks. In this talk, we introduce a Barron space of functions on a compact domain of graph signals, discuss its various properties, such as reproducing kernel Banach space property and universal approximation property. We will also discuss well approximation property of functions in the Barron space by outputs of some GCNNs, and learnability of functions in the Barron space from their random samples.

February 8:  Nadav Dym (Technion)

Efficient Invariant Embeddings for multisets and point clouds (Watch the Video)

In many machine learning tasks, the goal is to learn an unknown function which has some known group symmetries. Equivariant machine learning algorithms exploit this by devising architectures (=function spaces) which have these symmetries by construction. Examples include convolutional neural networks which respect translation symmetries, neural networks for graphs or sets which respect their permutation symmetries, or neural networks for 3D point sets which additionally respect Euclidean symmetries.


A common theoretical requirement of symmetry based architecture is that they will be able to separate any two objects which are not related by a group symmetry .  We will review results showing that under very general assumptions such a symmetry preserving separating mapping  f exists, and the embedding dimension  can be taken to be just over twice the dimension of the data. We will then propose a general methodology for efficient computation of such f using random invariants, based on a ‘finite witness theorem’. This methodology is a generalization of the algebraic geometry argument used for the well known proof of phase retrieval injectivity, and can now incorporate analytic as well as algebraic functions and constraints. In particular, we will show that standard permutation invariant architectures for multisets are separating, if and only if they use analytic rather than piecewise linear activations. 


Based on work with Steven J. Gortler, Tal Amir, Snir Hordan, Ilai Avni and Ravina Ravina,  and on the papers:


[1] Low dimensional Invariant Embeddings for Universal Geometric Learning Nadav Dym and Steven J. Gortler

[2] Neural Injective Functions for Multisets, Measures and Graphs via a Finite Witness Theorem T. Amir S.J. Gortler, I. Avni, R. Ravina and N. Dym

[3] Complete Neural Networks for Euclidean Graphs  Snir Hordan, Tal Amir, Steven J. Gortler and Nadav Dym


February 2:  Ming Yuan (Columbia University)

Spectral Learning for High Dimensional Tensors

Matrix perturbation bounds developed by Weyl, Davis, Kahan and Wedin and others play a central role in many statistical and machine learning problems. I shall discuss some of the recent progresses in developing similar bounds for higher order tensors. I will highlight the intriguing differences from matrices, and explore their implications in spectral learning problems.

This talk presents joint work with Arnab Audyy. More details can be found in
Auddy A, Yuan M. Perturbation bounds for (nearly) orthogonally decomposable tensors with statistical applications[J]. Information and Inference: A Journal of the IMA, 2023, 12(2): 1044-1072. 

January 18:  Xue-Cheng Tai (NORCE)

PottsMGNet: A Mathematical Explanation of Encoder-Decoder Based Neural Networks (Watch the Video)

For problems in image processing and many other fields, a large class of effective neural networks has encoder-decoder-based architectures. Although these networks have made impressive performances, mathematical explanations of their architectures are still underdeveloped. In this paper, we study the encoder-decoder-based network architecture from the algorithmic perspective and provide a mathematical explanation. We use the two-phase Potts model for image segmentation as an example for our explanations.  We associate the segmentation problem with a control problem in the continuous setting. Then, the continuous control model is time discretized by an operator splitting scheme, the PottsMGNet, and space discretized by the multigrid method.}

We show that the resulting discrete PottsMGNet is equivalent to an encoder-decoder-based network. With minor modifications, it is shown that a number of the popular encoder-decoder-based neural networks are just instances of the proposed PottsMGNet.  By incorporating the Soft-Threshold-Dynamics into the PottsMGNet as a regularizer, the PottsMGNet has shown to be robust with the network parameters such as network width and depth and achieved remarkable performance on datasets with very large noise. In nearly all our experiments, the new network always performs better or as good on accuracy and dice score than existing networks for image segmentation.   This talk is based on a joint work with Prof Hao Liu (HKBU) and Prof Raymond Chan (HK CityU)

January 11:  John Harlim (Pennsylvania State University)

Identifying Missing Dynamics with Machine Learning Algorithms (Watch the Video)

In the talk, I will discuss a general closure framework to compensate for the model error arising from missing dynamical systems. The proposed framework reformulates the model error problem into a supervised learning task to estimate a very high-dimensional closure model, deduced from the Mori-Zwanzig representation of a projected dynamical system with a projection operator chosen based on Takens embedding theory. Besides theoretical convergence, this connection provides a systematic framework for closure modeling using available machine learning algorithms. I will demonstrate supporting numerical examples in predicting spatiotemporal chaotic systems such as the Kuramoto-Sivashinsky equation and the more challenging statistical closure problems due to lack of training data. Beyond the training data issues, the closure problem has several practical challenges in non-homogeneous statistical regimes. That is, we need to ensure that the closure model produces positive-definite covariance matrix estimation yet overcomes the inherent instability of the stochastic fluctuation dynamics. If time permits, I will discuss error bounds and mathematical conditions that allow for the estimated model (attained by machine learning algorithm) to reproduce the underlying stationary statistics, such as one-point statistical moments and auto-correlation functions, in learning Ito diffusions. Based on the perturbation theory of ergodic Markov chains and the linear response theory, we deduce a linear dependence of the errors of one-point and two-point invariant statistics on the error in the learning of the drift and diffusion coefficients. We examine the mathematical conditions for the theoretical error bounds on two well understood learning algorithms: the kernel-based spectral regression method and the shallow random neural networks with the ReLU activation function.

December 14: Gabriele Steidl (TU Berlin)

Generative Modeling via Maximum Mean Discrepancy Flows  (Watch the Video)    

We consider gradient flows with respect to the maximum mean discrepancy(MMD) with negative distance kernel, which is also known as energy distance. In order to achieve computational efficiency, we prove that for certain kernels the MMD coincides with its sliced version.  Therefore, all computations can be performed in a one-dimensional setting, where the MMD with negative distance kernel can be evaluated by a simple sorting algorithm with improved computational complexity. This enables us to simulate MMD particle flows in high dimensions for a large number of particles. We approximate these particle flows by neural networks and apply them for generative modeling and posterior sampling in Bayesian inverse problems.

From a theoretical viewpoint, we study Wasserstein gradient flows with respect to our MMD functionals. Interestingly, particles might "explode" in this setting, i.e., the flow turns atomic measures into absolutely continuous ones and vice versa. We analytically derive the Wasserstein flows for some special cases and propose a numerical approximation of suitable forward and backward time discretizations by generative neural networks.


Joint work with J. Hertrich, P. Hagemann, F. Altekrüger, R. Beinert and M. Gräf (TU Berlin).   

December 7:   Laurent Demanet (MIT)

Recovery Phenomena with Symmetric Autoencoders (Watch the Video)    

Is it possible to guess what a scene in an image would look like if the picture was taken from a different angle? Would it sound like you if an AI generated a deepfake of your voice? Can we find the solution of a PDE we have never seen, if we collect enough solutions of nearby equations? These questions seem to fit in a common mathematical framework of estimation of low-dimensional latent processes under maps of controlled complexity. After reviewing known results in the context of generative priors, I will explain how to formulate recovery guarantees for symmetric autoencoders using tools from applied probability and approximation theory. Joint work with Borjan Geshkovski (MIT).  

November 30:   Eliza Negrini (UCLA)

Applications of No-Collision Transportation Maps in Manifold Learning (Watch the Video)

In this work, we investigate applications of no-collision transportation maps introduced by Nurbekyan et al. in 2020 in manifold learning for image data. Recently, there has been a surge in applying transportation-based distances and features for data representing motion-like or deformation-like phenomena. Indeed, comparing intensities at fixed locations often does not reveal the data structure. No-collision maps and distances developed in [Nurbekyan et al., 2020] are sensitive to geometric features similar to optimal transportation (OT) maps but much cheaper to compute due to the absence of optimization. In this work, we prove that no-collision distances provide an isometry between translations (respectively dilations) of a single probability measure and the translation (respectively dilation) vectors equipped with a Euclidean distance. Furthermore, we prove that no-collision transportation maps, as well as OT and linearized OT maps, do not in general provide an isometry for rotations.  The numerical experiments confirm our theoretical findings and show that no-collision distances achieve similar or better performance on several manifold learning tasks compared to other OT and Euclidean-based methods at a fraction of a computational cost.

November 23Yi Hui (National University of Singapore)

Self-supervised Deep Learning for Inverse Imaging Problems (Watch the Video)

Deep learning has become a prominent tool in a variety of sectors, including inverse problems imaging. The scarcity of ground-truth images in certain fields such as medicine and science has limited the applicability of supervised learning strategies in these data-constrained environments. In this talk, I will discuss several self-supervised learning techniques that enable teaching a Deep Neural Network (DNN) to reconstruct images using merely their noisy and incomplete measurements. The proposed approaches integrate Bayesian inferences, including MMSE estimator and the Expectation-Maximization algorithm, with the over-parametrization of images by untrained DNNs, where Monte Carlo sampling provides the computational foundation. Extensive experiments showed that these self-supervised learning techniques exhibit  competitive performance in comparison to supervised learning across various real-world imaging applications.

November 16:  Amit Singer (Princeton University)

Method of Moments in Cryo-EM (Watch the Video) 

Single-Particle Electron Cryomicroscopy (cryo-EM) will soon become the leading technique for determining 3-D molecular structures at high resolution. In cryo-EM, the 3-D structure needs to be determined from many 2-D noisy tomographic projection images of the molecule taken at unknown viewing directions and positions. The maximum likelihood approach has been very successful in determining large molecular structures, but struggles with small molecules for which the signal-to-noise (SNR) of the images is too low for accurate viewing direction estimation and detection. Motivated by the challenge of reconstructing small molecules by cryo-EM, we have been investigating the method of moments as an alternative statistical and computational framework. In particular, the method of moments provides theoretical insight about the sample complexity, that is, the number of images required for reconstruction. No prior knowledge of cryo-EM or the method of moments is necessary for this talk.

September 21, 2023:  Anders Hansen (University of Cambridge)

On Smale's 9th problem, generalised hardness of approximation and the limits of AI (Watch the Video)

Alchemists wanted to create gold, Hilbert wanted an algorithm to solve Diophantine equations, researchers want to make deep learning robust in AI without hallucinations, MATLAB wants (but fails) to detect when it provides wrong solutions to linear programs, etc. Why do we fail in so many of these fundamental cases? The reason is typically methodological barriers. The history of science is full of methodological barriers -- reasons for why we never succeed in reaching certain goals. In many cases, this is due to barriers in the foundations of mathematics. This talk introduces new such barriers from foundations: the phenomenon of generalised hardness of approximation (GHA). GHA grows out of our work on Smale’s 9th problem on the list of mathematical problems for the 21st century. This phenomenon is not a rare issue -- but happens on a daily basis in computations (in particular in AI) -- and causes modern software such as MATLAB to fail on basic problems, and even certify nonsensical solutions as correct. GHA is close in spirit to hardness of approximation (HA) in computer science. Assuming P not equal to NP, HA is the phenomenon that one can easily compute an epsilon-approximation to the solution of a discrete computational problem for epsilon greater than epsilon_0 greater than 0, but for epsilon less than epsilon_0 it suddenly becomes intractable. HA was discovered decades ago and has been transformative being the subject of several Goedel, Nevanlinna and ACM prizes. GHA is a similar but distinct mathematical phenomenon that requires a new set of mathematical tools in computation rather than computer science (NB: GHA is independent of P vs. NP). The GHA grows out of the Solvability Complexity Index (SCI) hierarchy framework, and has so far been detected in optimisation, inverse problems, deep learning and AI, as well as computer-assisted proofs. We will discuss in particular the connections to AI. 

November 9:  Zhao Ren (The University of Pittsburgh)

Heteroskedastic Sparse PCA in High Dimensions (Watch the Video)

 Principal component analysis (PCA) is one of the most commonly used techniques for dimension reduction and feature extraction. Though it has been well-studied for high-dimensional sparse PCA, little is known when the noise is heteroskedastic, which turns out to be ubiquitous in many scenarios, like biological sequencing data and information network data. We propose an iterative algorithm for sparse PCA in the presence of heteroskedastic noise, which alternatively updates the estimates of the sparse eigenvectors using the power method with adaptive thresholding in one step, and imputes the diagonal values of the sample covariance matrix to reduce the estimation bias due to heteroskedasticity in the other step. Our procedure is computationally fast and provably optimal under the generalized spiked covariance model, assuming the leading eigenvectors are sparse. A comprehensive simulation study demonstrates its robustness and effectiveness in various settings.

November 2:  Kathlén Kohn (KTH Stockholm)

Understanding Linear Convolutional Neural Networks via Sparse Factorizations of Real Polynomials (Watch the Video )

This talk will explain that Convolutional Neural Networks without activation parametrize  polynomials that admit a certain sparse factorization. For a fixed network architecture, these polynomials form a semialgebraic set. We will investigate how the geometry of this semialgebraic set (e.g., its singularities and relative boundary) changes with the network architecture. Moreover, we will explore how these geometric properties affect the optimization of a loss function for given training data. We prove that for architectures where all strides are larger than one and generic data, the non-zero critical points of the squared-error loss are smooth interior points of the semialgebraic function space. This property is known to be false for dense linear networks or linear convolutional networks with stride one. 

This talk is based on joint work with Guido Montúfar, Vahid Shahverdi, and Matthew Trager.

October 26:  Michael Murray (UCLA)

Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign? (Watch the Video )

In this talk I’ll discuss recent work studying benign overfitting in two-layer ReLU networks trained using gradient descent and hinge loss on noisy data for binary classification. Unlike logistic or exponentially tailed losses the implicit bias in this setting is poorly understood and therefore our results and techniques are distinct from other recent and concurrent works on this topic. In particular, we consider linearly separable data for which a relatively small proportion of labels are corrupted and identify conditions on the margin of the clean data which give rise to three distinct training outcomes: benign overfitting, in which zero loss is achieved and with high probability test data is classified correctly; overfitting, in which zero loss is achieved but test data is misclassified with probability lower bounded by a constant; and non-overfitting, in which clean points, but not corrupt points, achieve zero loss and again with high probability test data is classified correctly. Our analysis provides a fine-grained description of the dynamics of neurons throughout training and reveals two distinct phases: in the first phase clean points achieve close to zero loss, in the second phase clean points oscillate on the boundary of zero loss while corrupt points either converge towards zero loss or are eventually zeroed by the network. We prove these results using a combinatorial approach that involves bounding the number of clean versus corrupt updates across these phases of training. 

October 19:  Yifei Lou (University of North Carolina at Chapel Hill)

Scale-invariant regularizations for sparse signal and low-rank tensor recovery (Watch the Video)

Regularization plays a pivotal role in tackling challenging ill-posed problems by guiding solutions towards desired properties. In this presentation, I will introduce the ratio of the L1 and L2 norms, denoted as L1/L2, which serves as a scale-invariant and parameter-free regularization method for approximating the elusive L0 norm. Our theoretical analysis reveals a strong null space property (sNSP) and proves that any sparse vector qualifies as a local minimizer of the L1/L2 model when a system matrix adheres to the sNSP condition. Furthermore, we extend the L1/L2 model to the realm of low-rank tensor recovery by introducing a tensor nuclear norm over the Frobenius norm (TNF). We demonstrate that local optimality can be assured under an NSP-type condition. Given that both the L1 and L2 norms are absolutely one-homogeneous functions, we propose a gradient descent flow method to minimize the quotient model similarly to the Rayleigh quotient minimization problem. This derivation offers valuable numerical insights into convergence analysis and the boundedness of solutions. Throughout the presentation, we will explore a range of applications, including limited angle CT reconstruction and video background modeling, showcasing the superior performance of our approach compared to state-of-the-art methods.

October 5:  Christopher Musco (New York University)

Robust Active Learning via Leverage Score Sampling (Watch the Video)

Active learning is a promising approach to fitting machine learning models in "data starved" applications, where the cost of collecting labels is the primary cost of model training. In many of these applications, including in computational science and ML guided engineering, we need active learning methods that work in the challenging agnostic or "adversarial noise" setting. In this setting, collected labels might not match the model being trained, even in expectation. Nevertheless, we seek methods that are robust enough to find the best possible fit with as little data as possible. In this talk, I will discuss recent developments on a flexible class of active learning algorithms based on leverage score sampling, a technique originally used to develop randomized algorithms for large scale linear algebra. I will show how leverage score based methods can provably address the challenging agnostic learning problem in a variety of settings, including for linear models, kernel regression models, and also simple neural networks with non-linearities. 

Based on joint work with Xiaoou Cheng, Aarshvi Gajjar, Tamás Erdélyi, Chinmay Hegde, Raphael Meyer, Cameron Musco, Atsushi Shimzu Jonathan Weare, David Woodruff, Taisuke Yasuda, and Samson Zhou.

September 28, 2023: Misha Kilmer (Tufts University)

Tensor BM-Decomposition for Compression and Analysis of Spatio-Temporal Video Data (Watch the Video)

Given tensors A, B, C  of size  m x 1 x n, m x p x 1 , and 1 x p x n, respectively, their Bhattacharya-Mesner (BM) product will result in a third order tensor of dimension m x p x n  and BM-rank of 1 (Mesner and Bhattacharya, 1990). Thus, if a third-order tensor can be written as a sum of a small number of such BM-rank 1 terms, this BM-decomposition (BMD) offers an implicitly compressed representation of the tensor.   First, we give a generative model which illustrates that spatio-temporal video data can be expected to have low BM-rank.  We present and study properties of an iterative algorithm for computing an approximate BMD, including convergence behavior and a starting guess that allows for the decomposition of our spatial-temporal data into stationary and non-stationary components. Several numerical experiments show the power of our BMD approach, particularly as compared to the state-of-the art dynamic mode decomposition method, in extracting important temporal information from video data while simultaneously compressing the data.  

September 21, 2023: Anders Hansen (University of Cambridge)

On Smale's 9th problem, generalised hardness of approxiation and the limits of AI  (Watch the Video)

Alchemists wanted to create gold, Hilbert wanted an algorithm to solve Diophantine equations, researchers want to make deep learning robust in AI without hallucinations, MATLAB wants (but fails) to detect when it provides wrong solutions to linear programs, etc. Why do we fail in so many of these fundamental cases? The reason is typically methodological barriers. The history of science is full of methodological barriers — reasons for why we never succeed in reaching certain goals. In many cases, this is due to barriers in the foundations of mathematics. This talk introduces new such barriers from foundations: the phenomenon of generalised hardness of approximation (GHA). GHA grows out of our work on Smale’s 9th problem on the list of mathematical problems for the 21st century. This phenomenon is not a rare issue — but happens on a daily basis in computations (in particular in AI) — and causes modern software such as MATLAB to fail on basic problems, and even certify nonsensical solutions as correct. 

GHA is close in spirit to hardness of approximation (HA) in computer science. Assuming P not equal to NP, HA is the phenomenon that one can easily compute an epsilon-approximation to the solution of a discrete computational problem for epsilon > epsilon_0>0, but for epsilon < epsilon_0 it suddenly becomes intractable. HA was discovered decades ago and has been transformative being the subject of several Goedel, Nevanlinna and ACM prizes. GHA is a similar but distinct mathematical phenomenon that requires a new set of mathematical tools in computation rather than computer science (NB: GHA is independent of P vs. NP). The GHA grows out of the Solvability Complexity Index (SCI) hierarchy framework, and has so far been detected in optimisation, inverse problems, deep learning and AI, as well as computer-assisted proofs. We will discuss in particular the connections to AI.  

September 14, 2023:  March Tian Boedihardjo (Michigan State University)

Covariance Loss and Privacy (Slides)

I will formulate a problem concerning covariance loss, present a solution, and give an application to differential privacy.  Joint work with Thomas Strohmer and Roman Vershynin.

September 7, 2023:  Jamie Haddock (Harvey Mudd College)

Randomized Kaczmarz Methods: Corruption, Consensus, and Concentration (Watch the Video )

The Kaczmarz methods are a family of simple, deterministic or randomized, iterative methods which can be employed for solving consistent systems of linear equations of the form Ax = b, or related problems.  These methods have gained popularity in recent times due to their amenability to large-scale data and distributed computing environments.  This talk will focus on results in three areas, all related or applicable to the Kaczmarz methods: iterative methods for adversarially corrupted systems of linear equations; analyzing the dynamics of simple models of consensus amongst interacting agents; and proving bounds on the concentration and variance of randomized iterative methods.  This talk presents joint works with Deanna Needell, Elizaveta Rebrova, William Swartworth, Benjamin Jarman, Chen Yap, Toby Anderson, and Max Collins. 

August 17, 2023:  Felix Krahmer (Technical Univeristy of Munich)

Robust low-rank completion with adversarial noise (Watch the Video)

The problem of recovering a high-dimensional low-rank matrix from a  limited set of random measurements has enjoyed various applications and  gained a detailed theoretical foundation over the last 15 years. An  instance of particular interest is the matrix completion problem where the measurements are entry observations. The first rirgorous recovery guarantees for this problem were derived for the nuclear norm minimization approach, a convex proxy for the NP-hard problem of constrained rank minimization. For matrices whose entries are ”spread out” well enough, this convex problem admits a unique solution which corresponds to the ground truth. In the presence of random measurement noise, the reconstruction performance is also well-studied, but the performance for adversarial noise remains less understood. While some error bounds have been derived for both convex and nonconvex approaches, these bounds exhibit a gap to information-theoretic lower bounds and provable performance for Gaussian measurements. However, a recent analysis of the problem suggests that under small-scale adversarsial noise, the reconstruction error can be significantly amplified. In this talk, we investigate this amplification quantitatively and provide new reconstruction bounds for both small and large noise levels that suggest a quadratic dependence between the reconstruction error and the noise level.

Joint work with Julia Kostin (ETH Zurich, Switzerland) and Dominik Stöger (KU Eichstätt-Ingolstadt, Germany)

Third Year of Talks Below:  September 2022 - May 2023

Organizers (July 2022 - June 2023):  Axel Flinth (Principal Organizer for Europe/Asia, Umeå University), Longxiu Huang (Principal Organizer for The Americas, Michigan State University), Alex Cloninger (UC San Diego), Jamie Haddock (Harvey Mudd College), Mark Iwen (Michigan State University), Felix Krahmer (Technische Universität München), Weilin Li (City College of New York), Karin Schnass (University of Innsbruck), and Yuying Xie (Michigan State University).

May 25: Kui Ren (Columbia University) 

On Weighted Least-Squares for Computational Optimization (Watch the Video) 

 Weighted least-squares optimization has been revisited in recent years for the computational solution of data-matching problems. Different weighting strategies have been proposed depending on the features in the solutions one wants to promote or suppress. We will review some of the recent results and applications in this direction and highlight the impact of the weighting schemes on problems with noisy data.

May 24: Joel Tropp (California Institute of Technology)

 Randomly pivoted Cholesky: Addressing the changes of large-scale kernel computations (Watch the Video

 Kernel methods are used for prediction and clustering in many data science and scientific computing applications, but applying kernel methods to a large number of data points N is expensive due to the high cost of manipulating the N x N kernel matrix. A basic approach for speeding up kernel computations is low-rank approximation, in which we replace the kernel matrix A with a factorized approximation that can be stored and manipulated more cheaply. When the kernel matrix A has rapidly decaying eigenvalues, mathematical existence proofs guarantee that A can be accurately approximated using a constant number of columns (without ever looking at the full matrix). Nevertheless, for a long time, designing a practical and provably justified algorithm to select the appropriate columns proved challenging.


This talk introduces Randomly Pivoted Cholesky (RPC), a natural algorithm for approximating an N x N positive-semidefinite matrix using k adaptively sampled columns. RPC can be implemented with just a few lines of code; it requires only (k+1)N entry evaluations and O(k^2 N) additional arithmetic operations. In experiments, RPC matches or improves on the performance of alternative algorithms for low-rank psd approximation. Moreover, RPC provably achieves near-optimal approximation guarantees. The simplicity, effectiveness, and robustness of this algorithm strongly support its use for large-scale kernel computations. This work offers an accessible example of the power of using randomized algorithms for linear algebra computations.


Joint work with Yifan Chen, Ethan Epperly, and Rob Webber. Available at arXiv:2207.06503.

May 18: Jing Lei (Carnegie Mellon University)

On High-Dimensional Gaussian Comparisons For Cross-Validation (Watch the Video

 We derive high-dimensional Gaussian comparison results for the standard V-fold cross-validated risk estimates.  Our result combines a recent stability-based argument for the low-dimensional central limit theorem of cross-validation with the  high-dimensional Gaussian comparison framework for sums of independent random variables.  These results give new insights into the joint sampling distribution of cross-validated risks in the context of model comparison and tuning parameter selection, where the number of candidate models and tuning parameters can be larger than the fitting sample size. As a consequence, our results provide theoretical support for a recent methodological development that constructs model confidence sets using cross-validation.

May 11: Lenka Zdeborova (EPFL) 

Insights on learning with neural networks from exactly solvable high-dimensional models (Watch the Video)

Statistical physics has studied exactly solvable models of neural networks since more than four decades. In this talk, we will put this line of work in perspective of recent empirical observations stemming from deep learning. We will describe several types of phase transition that appear in the limit of large sizes as a function of the amount of data. Discontinuous phase transitions are linked to adjacent algorithmic hardness. This so-called hard phase influences the behaviour of gradient-descent-like algorithms. We show a case where the hardness is mitigated by overparametrization proposing that the benefits of overparametrization may be linked to the usage of a certain type of algorithms. We then discuss the overconfidence of overparametrized neural networks and evaluate methods to mitigate it, and calibrate the uncertainty.  

May 4: Petros Drineas (Purdue University)  

Randomized Linear Algebra for Interior Point Methods (Watch the Video)

Linear programming is a central problem in computer science and applied mathematics with numerous applications across a wide range of domains, including machine learning and data science.  Interior point methods (IPMs) are a common approach to solving linear programs with strong theoretical guarantees and solid empirical performance. The time complexity of these methods is dominated by the cost of solving a linear system of equations at each iteration. In common applications of linear programming, particularly in data science and scientific computing, the size of this linear system can become prohibitively large, requiring the use of iterative solvers which provide an approximate solution to the linear system. Approximately solving the linear system at each iteration of an IPM invalidates common analyses of IPMs and the theoretical guarantees they provide.  In this talk we will discuss how randomized linear algebra can be used to design and analyze theoretically and practically efficient IPMs when using approximate linear solvers. 

April 27: Anru Zhang (Duke University)

Tensor Learning in 2020s: Methodology, Theory, and Applications (Watch the Video)

The analysis of tensor data, i.e., arrays with multiple directions, has become an active research topic in the era of big data. Datasets in the form of tensors arise from a wide range of scientific applications. Tensor methods also provide unique perspectives to many high-dimensional problems, where the observations are not necessarily tensors. Problems in high-dimensional tensors generally possess distinct characteristics that pose great challenges to the data science community. 

 

In this talk, we discuss several recent advances in tensor learning and their applications in computational imaging and microbiome. We also illustrate how we develop statistically optimal methods and computationally efficient algorithms that interact with the modern theories of computation, high-dimensional statistics, and non-convex optimization.

April 20: Clarice Poon (University of Bath) 

Smooth over-parametrized solvers for non-smooth structured optimization (Watch the Video)

Non-smooth optimization is a core ingredient of many imaging or machine learning pipelines. Non-smoothness encodes structural constraints on the solutions, such as sparsity, group sparsity, low-rank and sharp edges. It is also the basis for the definition of robust loss functions such as the square-root lasso.  Standard approaches to deal with non-smoothness leverage either proximal splitting or coordinate descent. The effectiveness of their usage typically depend on proper parameter tuning, preconditioning or some sort of support pruning.  


In this work, we advocate and study a different route. By over-parameterization and marginalising on certain variables (Variable Projection), we show how many popular non-smooth structured problems can be written as smooth optimization problems. The result is that one can then take advantage of quasi-Newton solvers such as L-BFGS and this, in practice, can lead to substantial performance gains. Another interesting aspect of our proposed solver is its efficiency when handling imaging problems that arise from fine discretizations (unlike proximal methods such as Iterative Soft Thresholding/Forward Backward whose convergence is known to have exponential dependency on dimension). On a theoretical level, one can connect gradient descent on our over-parameterized formulation with mirror descent with a varying Hessian metric. This observation can then be used to derive dimension free convergence bounds and explains the efficiency of our method in the fine-grids regime.


This talk is based on joint work with Gabriel Peyre. (https://arxiv.org/abs/2205.01385 ) https://youtu.be/p59_AgD6sX4

April 13: Marina Meila (University of Washington) 

Manifold Learning 2.0: Explanations and Eigenflows (Watch the Video)

Abstract: Manifold learning algorithms can recover the underlying low-dimensional parametrization of high-dimensional point clouds. This talk will extend this paradigm in two directions.  First, we ask if it is possible, in the case of scientific data where quantitative prior knowledge is abundant, to explain a data manifold by new coordinates, chosen from a set of scientifically meaningful functions.  Second, I will show how some popular manifold learning tools and applications can be recreated in the space of  vector fields and flows on a manifold. Central to this approach is the order 1-Laplacian of a manifold, $\Delta_1$, whose eigen-decomposition, known as the Helmholtz-Hodge Decomposition, provides a basis for all vector fields on a manifold. We present a consistent estimator for $\Delta_1$, which enables  visualization, analysis, smoothing and semi-supervised learning of vector fields on a manifold. In topological data analysis, we describe the 1st-order analogue of spectral clustering, which amounts to  prime manifold decomposition. Furthermore, from this decomposition a new algorithm for finding shortest independent loops follows. 


Joint work with  Yu-Chia Chen, Weicheng Wu, Samson Koelle, Hanyu Zhang and Ioannis Kevrekidis

April 6: Alexander Strang (University of Chicago) 

 Strategic Feature Extraction and Low Dimensional Representation of Games (Watch the Video)

Abstract: Games are widely used to test reinforcement learning paradigms, to study competitive systems in economics and biology, and to model decision tasks. Empirical game theory studies games through observation of populations of interacting agents. We introduce a generic low-dimensional embedding scheme that maps agents into a latent space which enables visualization, interpolation, and strategic feature extraction. The embedding can be used for feature extraction since it represents a generic game as a combination of simpler low dimensional games. Through examples, we illustrate that these components may correspond to basic strategic trade-offs. We then show that the embedding scheme can represent all games with bounded payout, or whose payout has finite variance when two agents are sampled at random. We develop a formal approximation theory for the representation, study the stability of the embedding, provide sufficient sampling guidelines, and suggest regularizers which promote independence in the identified features.

March 30: Nicolas Keriven (CNRS, IRISA Rennes)

Entropic Optimal Transport and Wasserstein Barycenters on random graphs (Watch the Video)

In graph analysis, a classic task consists in computing similarity measures between (groups of) nodes. In latent space random graphs, nodes are associated to unknown latent variables. One may then seek to compute distances directly in the latent space, using only the graph structure. In this paper, we show that it is possible to consistently estimate entropic-regularized Optimal Transport (OT) distances between groups of nodes in the latent space. We provide a general stability result for entropic OT with respect to perturbations of the cost matrix. We then apply it to several examples of random graphs, such as graphons or epsilon-graphs on manifolds. We then move on to Wasserstein Barycenters, which are routinely used on manifolds that can be represented as random graphs, and examine the same stability properties. This is joint work with Marc Theveneau. 

March 23: Kasso Okoudjou (Tufts University)

The HRT Conjecture: A call for a numerical approach (Watch the Video)

The two-scale relation in wavelet analysis dictates that a square-integrable function can be written as a linear combination of scaled and shifted copies of itself. This fact is equivalent to the existence of square-integrable functions whose time-scale shifts are linearly dependent. By contrast, by replacing the scaling operator with a modulation operator one would think that the linear dependency of the resulting time-frequency shifts of a square-integrable function might be easily inferred.  However, more than two decades after  C.~Heil, J.~Ramanatha, and P.~Topiwala conjectured that any such finite collection of time-frequency shifts of a non-zero square-integrable function on the real line is linearly independent, this problem (the HRT Conjecture) remains unresolved.

 

The talk will give an overview of the HRT conjecture and introduce an inductive approach to investigate it. I will highlight a few methods that have been effective in solving the conjecture in certain special cases. However, despite the origin of the HRT conjecture in Applied and Computational Harmonic Analysis, there is a lack of experimental or numerical methods to resolve it. I will present an attempt to investigate the conjecture numerically.

   March 16: Elizabeth Newman (Emory University) 

A Matrix-Mimetic Tensor Algebra for Optimal Representations of Multiway Data (Watch the Video)

Big data has revolutionized the landscape of computational mathematics and has increased the demand for new numerical linear algebra tools to handle the vast amount of data.  One crucial task is to efficiently capture inherent structure in data using dimensionality reduction and feature extraction.  Tensor-based approaches have gained significant traction in this setting by leveraging multilinear relationships in high-dimensional data.  In this talk, we will describe a matrix-mimetic tensor algebra that offers provably optimal compressed representations of multiway data via a family of tensor singular value decompositions (SVDs).  Moreover, using the inherited linear algebra properties of this framework, we will prove that these tensor SVDs outperform the equivalent matrix SVD and two closely related tensor decompositions, the Higher-Order SVD and Tensor-Train SVD, in terms of approximation accuracy.  Throughout the talk, we will provide numerical examples to support the theory and demonstrate practical efficacy of constructing optimal tensor representations.


This presentation will serve as an overview of our PNAS paper "Tensor-tensor algebra for optimal representation and compression of multiway data" (https://www.pnas.org/doi/10.1073/pnas.2015851118).

March 9: Claire Boyer (Sorbonne Université)

Is interpolation benign for random forest regression? (Watch the Video)

Statistical wisdom suggests that very complex models, interpolating training data, will be poor at predicting unseen examples. Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity. While it is widely known that fully-grown decision trees interpolate and, in turn, have bad predictive performances, the same behavior is yet to be analyzed for Random Forests (RF). In this paper, we study the trade-off between interpolation and consistency for several types of RF algorithms. Theoretically, we prove that interpolation regimes and consistency cannot be achieved simultaneously for several non-adaptive RF. Since adaptivity seems to be the cornerstone to bring together interpolation and consistency, we study interpolating Median RF which are proved to be consistent in the interpolating regime. This is the first result conciliating interpolation and consistency for RF, highlighting that the averaging effect introduced by feature randomization is a key mechanism, sufficient to ensure the consistency in the interpolation regime and beyond. Numerical experiments show that Breiman's RF are consistent while exactly interpolating, when no bootstrap step is involved. We theoretically control the size of the interpolation area, which converges fast enough to zero, giving a necessary condition for exact interpolation and consistency to occur in conjunction. 

March 2: Yuejie Chi (Carnegie Mellon University)

The Non-asymptotics of Reinforcement Learning (Watch the Video)

 Reinforcement learning (RL) is garnering significant interest in recent years due to its success in a wide variety of modern applications. However, theoretical understandings on the non-asymptotic sample and computational efficiencies of RL algorithms remain elusive, and are in imminent need to cope with the ever-increasing problem dimensions. In this talk, we discuss our recent progress that sheds light on understanding the efficacy of popular RL algorithms in finding the optimal policy in tabular Markov decision processes. 

February 23: Yufeng Liu (University of North Carolina at Chapel Hill) 

Learning Individualized Treatment Rules with Many Treatments (Watch the Video)

Learning an optimal Individualized Treatment Rule (ITR) is a very important problem in precision medicine. In this talk, we consider the challenge when the number of treatment arms is large, and some groups of treatments in the large treatment space may work similarly for the patients. Motivated by the recent development of supervised clustering, we propose a novel adaptive fusion-based method to cluster the treatments with similar treatment effects together and estimate the optimal ITR simultaneously through a single convex optimization. We establish the theoretical guarantee of recovering the underlying true clustering structure of the treatments for our method. Finally, the superior performance of our method will be demonstrated via both simulations and a real data application on cancer treatment.


This is joint work with Haixu Ma and Donglin Zeng at UNC-Chapel Hill.

February 16: Nicolas Gillis (Université de Mons) 

Nonnegative Matrix Factorization: Introduction, Identifiability and Beyond (Watch the Video)

Given a nonnegative matrix X and a factorization rank r, nonnegative matrix factorization (NMF) approximates the matrix X as the product of a nonnegative matrix W with r columns and a nonnegative matrix H with r rows. NMF has become a standard linear dimensionality reduction technique in data mining and machine learning. In this talk, we first introduce NMF and show how it can be used as an interpretable unsupervised data analysis tool in various applications, including hyperspectral image unmixing, image feature extraction, and document classification. Then, we discuss the issue of non-uniqueness of NMF decompositions, also known as the identifiability issue, which is crucial in many applications. Finally, we discuss how we can go beyond NMF by considering non-linear and deep extensions which are useful in real-world applications and offer many venues for future research.

February  9: Elizabeth Munch(Michigan State University) 

Combining network analysis and persistent homology for classifying behavior of time series (Watch the Video) 

Persistent homology, the flagship method of topological data analysis, can be used to provide a quantitative summary of the shape of data.  One way to pass data to this method is to start with a finite, discrete metric space (whether or not it arises from a Euclidean embedding) and to study the resulting filtration of the Rips complex.  In this talk, we will discuss several available methods for turning a time series into a discrete metric space, including the Takens embedding, k-nearest neighbor networks, and ordinal partition networks.  Combined with persistent homology and machine learning methods, we show how this can be used to classify behavior in time series in both synthetic and experimental data. 


February 2: James Murphy (Tufts University) 

  Towards Intrinsically Low-Dimensional Models in Wasserstein Space: Geometry, Statistics, and Learning (Watch the Video)

We consider the problems of efficient modeling and representation learning for probability distributions in Wasserstein space.  We consider a general barycentric coding model in which data are represented as Wasserstein-2 (W2) barycenters of a set of fixed reference measures.  Leveraging the Riemannian structure of W2-space, we develop a tractable optimization program to learn the barycentric coordinates when given access to the densities of the underlying measures.  We provide a consistent statistical procedure for learning these coordinates when the measures are accessed only by i.i.d. samples.  Our consistency results and algorithms exploit entropic regularization of the optimal transport problem, thereby allowing our barycentric modeling approach to scale efficiently.  We also consider the problem of learning reference measures given observed data.  Our regularized approach to dictionary learning in Wasserstein space addresses core problems of ill-posedness and in practice learns interpretable dictionary elements and coefficients useful for downstream tasks.  Applications to image and natural language processing will be shown throughout the talk.

January 19: Madeleine Udell (Stanford University) 

    Low rank approximation for faster optimization (Watch the Video)

Low rank structure is pervasive in real-world datasets.  This talk shows how to accelerate the solution of fundamental computational problems, including eigenvalue decomposition, linear system solves, composite convex optimization, and stochastic optimization (including deep learning), by exploiting this low rank structure.  We present a simple method based on randomized numerical linear algebra for efficiently computing approximate top eigendecompositions, which can be used to replace large matrices (such as Hessians and constraint matrices) with low rank surrogates that are faster to apply and invert. The resulting solvers for linear systems (NystromPCG), composite convex optimization (NysADMM), and deep learning (SketchySGD) demonstrate strong theoretical and numerical support, outperforming state-of-the-art methods in terms of speed and robustness to hyperparameters.

January 12: Simon Foucart (Texas A&M University)

Three uses of semidefinite programming in approximation theory

In this talk, modern optimization techniques are publicized as fitting computational tools to attack several extremal problems from Approximation Theory which had reached their limitations based on purely analytical approaches. Three such problems are showcased: the first problem---minimal projections---involves minimization over measures and exploits the moment method; the second problem---constrained approximation---involves minimization over polynomials and exploits the sum-of-squares method; and the third problem---optimal recovery from inaccurate observations---is highly relevant in Data Science and exploits the S-procedure. In each of these problems, one ends up having to solve semidefinite programs. 

December 15: Guido Montufar (UCLA/Max Planck Institute)

 The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs (Watch the Video)

We consider the problem of finding the best memoryless stochastic policy for an infinite-horizon partially observable Markov decision process (POMDP) with finite state and action spaces with respect to either the discounted or mean reward criterion. We show that the (discounted) state-action frequencies and the expected cumulative reward are rational functions of the policy, whereby the degree is determined by the degree of partial observability. We then describe the optimization problem as a linear optimization problem in the space of feasible state-action frequencies subject to polynomial constraints that we characterize explicitly. This allows us to address the combinatorial and geometric complexity of the optimization problem using recent tools from polynomial optimization. In particular, we estimate the number of critical points and use the polynomial programming description of reward maximization to solve a navigation problem in a grid world. This is work with Johannes Müller. 


December 8: Rongjie Lai (Rensselaer Polytechnic Institute)

Computational Methods for Mean-field Games: from conventional numerical methods to deep generative models (Watch the Video)

Mean field game (MFG) problems study how a large number of similar rational agents make strategic movements to minimize their costs. In this talk, I will discuss our recent work on computational methods for MFGs. I will start from a low-dimensional setting using conventional discretization/optimization methods and discuss some convergence results of the proposed method. After that, I will extend my discussion to high-dimensional problems by bridging the trajectory representation of MFG with a special type of deep generative model, normalizing flows. This connection does not only help solve high-dimensional MFGs, but also provides a way to improve the robustness of normalizing flows. If time permits, I will discuss its extension to computing problems on low-dimensional manifolds. 

December 1: Zhaoran Wang (Northwestern University)

      Demystifying (Deep) Reinforcement Learning with Optimism and Pessimism (Watch the Video)

Coupled with powerful function approximators such as deep neural networks, reinforcement learning (RL) achieves tremendous empirical successes. However, its theoretical understandings lag behind. In particular, it remains unclear how to provably attain the optimal policy with a finite regret or sample complexity. In this talk, we will present the two sides of the same coin, which demonstrates an intriguing duality between optimism and pessimism.

– In the online setting, we aim to learn the optimal policy by actively interacting with the environment. To strike a balance between exploration and exploitation, we propose an optimistic least-squares value iteration algorithm, which achieves a \sqrt{T} regret in the presence of linear, kernel, and neural function approximators.

– In the offline setting, we aim to learn the optimal policy based on a dataset collected a priori. Due to a lack of active interactions with the environment, we suffer from the insufficient coverage of the dataset. To maximally exploit the dataset, we propose a pessimistic least-squares value iteration algorithm, which achieves a minimax-optimal sample complexity.

November 24: Hanna (Anna) Veselovska (TU Munich) 

Digital Halftoning via Weighted Sigma-Delta Quantization (Watch the Video)

In this talk, we consider error diffusion techniques for digital halftoning from the perspective of 1-bit Sigma-Delta quantization. We introduce a method to generate  Sigma-Delta schemes for two-dimensional signals as a weighted combination of its one-dimensional counterparts and show that various error diffusion schemes proposed in the literature, can be represented in this framework via Sigma-Delta schemes of 1st order. Under the model of two-dimensional bandlimited signals, which is motivated by a mathematical model of human visual perception, we discuss quantitative error bounds for such weighted Sigma-Delta schemes.

Motivated by the correspondence between existing error diffusion algorithms and first-order Sigma-Delta schemes, we study the performance of the analogous weighted combinations of second-order Sigma-Delta schemes and show that they exhibit superior performance in terms of guaranteed error decay for two-dimensional bandlimited signals. In numerical simulations for real-world images, we demonstrate that with some modifications to enhance stability this high-quality performance also translates to the problem of digital halftoning.


This is joint work with Felix Krahmer.

November 17: Wei Zhu (Univeristy of Massachusetts Amherst)

Symmetry-preserving machine learning for computer vision, scientific computing, and distribution learning  (Watch the Video)

Symmetry is ubiquitous in machine learning and scientific computing. Robust incorporation of symmetry prior into the learning process has shown to achieve significant model improvement for various learning tasks, especially in the small data regime. In the first part of the talk, I will explain a principled framework of deformation-robust symmetry-preserving machine learning. The key idea is the spectral regularization of the (group) convolutional filters, which ensures that symmetry is robustly preserved in the model even if the symmetry transformation is “contaminated” by nuisance data deformation. In the second part of the talk, I will demonstrate how to incorporate additional structural information (such as group symmetry) into generative adversarial networks (GANs) for data-efficient distribution learning. This is accomplished by developing new variational representations for divergences between probability measures with embedded structures. We study, both theoretically and empirically, the effect of structural priors in the two GAN players. The resulting structure-preserving GAN is able to achieve significantly improved sample fidelity and diversity—almost an order of magnitude measured in Fréchet Inception Distance—especially in the limited data regime. 


November 10: Sebastian Goldt (SISSA Trieste)

On the importance of higher-order input statistics for learning in neural networks (Watch the Video)

Neural networks are powerful feature extractors - but which features do they extract from their data? And how does the structure of the training data shape the representations they learn? We discuss two recent works in this direction. We first develop a simple model for images and show that a neural network trained on these images can learn a convolution from scratch [1]. This pattern-formation process is driven by a combination of translation-invariance of the "images" and the non-Gaussian, higher-order statistics of the inputs. Second, we conjecture a "distributional simplicity bias" whereby neural networks learn increasingly complex distributions of their inputs during training. We present analytical and experimental evidence for this conjecture, going from a simple perceptron up to deep ResNets [2]


References:

- Ingrosso & Goldt, PNAS 119 (40), e2201854119, arXiv:2202.00565

- Refinetti, Ingrosso & Goldt, in preparation.

November 3, 2022: Alex Townsend (Cornell University)

Learning Green's functions associated with elliptic and parabollic PDEs (Watch the video )

Can one learn a differential operator from pairs of solutions and righthand sides? If so, how many pairs are required? These two questions have received significant research attention in partial differential equation (PDE) learning. Given input-output pairs from an unknown elliptic or parabolic PDE, we will derive a theoretically rigorous scheme for learning the associated Green's function. By exploiting the hierarchical low-rank structure of Green’s functions and randomized linear algebra, we will have a provable learning rate. Along the way, we will develop essential new Green's function theory associated with parabolic PDEs and a more general theory for the randomized singular value decomposition.

November 3: Alex Townsend (Cornell University)

Learning Green's functions associated with elliptic and parabollic PDEs (Watch the Video )

Can one learn a differential operator from pairs of solutions and righthand sides? If so, how many pairs are required? These two questions have received significant research attention in partial differential equation (PDE) learning. Given input-output pairs from an unknown elliptic or parabolic PDE, we will derive a theoretically rigorous scheme for learning the associated Green's function. By exploiting the hierarchical low-rank structure of Green’s functions and randomized linear algebra, we will have a provable learning rate. Along the way, we will develop essential new Green's function theory associated with parabolic PDEs and a more general theory for the randomized singular value decomposition.

October 27, 2022: Guanghui (George) Lan (Georgia Institute of Technology)

Policy mirror descent for online reinforcement learning (Watch the Video)

Reinforcement Learning (RL) has attracted considerable interest from both industry and academia during the past few years. The study of RL algorithms with provable rates of convergence, however, is still in its infancy. In this talk, we discuss some recent progresses that bridge RL with stochastic nonlinear programming. We pay special attention to online reinforcement learning, which intends to continuously improve the system performances in-situ, with better and better policies are being discovered and deployed. More specifically, we introduce a new and general class of policy mirror descent (PMD) methods and show that they achieve linear convergence for the deterministic case and optimal sampling complexity for the stochastic case for discounted Markov decision processes. We also show how the gradient information can be estimated efficiently online through a few recently proposed conditional temporal difference methods. Extensions of these algorithms for the average reward and block coordinate settings will also be discussed. 

October 20: Pierre Weiss (CNRS/University of Toulouse)

Blind deblurring in microscopy (Watch the Video)

An image acquired with a microscope gets filtered and sampled. Although filtering is unavoidable, the loss of resolution is arguably one of the main obstacles to understanding the living world. This is evidenced by two Nobel prizes awarded to researchers who proposed systems overcoming the diffraction limit in the last decade.

The question raised in this presentation is: can we reconstitute part of the lost information when the response of the optical system is not perfectly known? We will present some theoretical and practical aspects of this problem. We will start by proposing precise forward models, then turn to rather well understood handcrafted methods and finish by deep learning based methods.

October 13, 2022: Dustin Mixon (The Ohio State University)  

Group-invariant max filtering  (Watch the Video)

October 6, 2022: Weijie Su (University of Pennsylvania)  

What Should a Good Deep Neural Network Look Like? Insights from a Layer-Peeled Model and the Law of Equi-Separation (Watch the Video)

In this talk, we will investigate the emergence of geometric patterns in well-trained deep learning models by making use of a layer-peeled model and the law of equi-separation. The former is a nonconvex optimization program that models the last-layer features and weights. We use the model to shed light on the neural collapse phenomenon of Papyan, Han, and Donoho, and to predict a hitherto-unknown phenomenon that we term minority collapse in imbalanced training. 

The law of equi-separation is a pervasive empirical phenomenon that describes how data are separated according to their class membership from the bottom to the top layer in a well-trained neural network. We will show that, through extensive computational experiments, neural networks improve data separation through layers in a simple exponential manner. This law leads to roughly equal ratios of separation that a single layer is able to improve, thereby showing that all layers are created equal. We will conclude the talk by discussing the implications of this law on the interpretation, robustness, and generalization of deep learning, as well as on the inadequacy of some existing approaches toward demystifying deep learning. This is based on joint work with Hangfeng He.

September 29, 2022: Shuyang Ling (NYU Shanghai)

Near-Optimal Bounds for Generalized Orthogonal Procrustes Problem via Generalized Power Method (Watch the Video)

Given multiple point clouds, how to find the rigid transform (rotation, reflection, and shifting) such that these point clouds are well aligned? This problem, known as the generalized orthogonal Procrustes problem (GOPP), has found numerous applications in statistics, computer vision, and imaging science. While one commonly-used method is finding the least squares estimator, it is generally an NP-hard problem to obtain the least squares estimator exactly due to the notorious nonconvexity. In this work, we apply the semidefinite programming (SDP) relaxation and the generalized power method to solve this generalized orthogonal Procrustes problem. In particular, we assume the data are generated from a signal-plus-noise model: each observed point cloud is a noisy copy of the same unknown point cloud transformed by an unknown orthogonal matrix and also corrupted by additive Gaussian noise. We show that the generalized power method (equivalently alternating minimization algorithm) with spectral initialization converges to the unique global optimum to the SDP relaxation, provided that the signal-to-noise ratio is high. Moreover, this limiting point is exactly the least squares estimator and also the maximum likelihood estimator. In addition, we derive a block-wise estimation error for each orthogonal matrix and the underlying point cloud. Our theoretical bound is near-optimal in terms of the information-theoretic limit (only loose by a factor of the dimension and a log factor). Our results significantly improve the state-of-the-art results on the tightness of the SDP relaxation for the generalized orthogonal Procrustes problem, an open problem posed by Bandeira, Khoo, and Singer in 2014.

September 22, 2022: Rayan Saab (University of California, San Diego)  

Algorithms for quantizing neural networks (Watch the Video)

Neural networks are highly non-linear functions often parametrized by a staggering number of weights. Miniaturizing these networks and implementing them in hardware is a direction of research that is fueled by a practical need, and at the same time connects to interesting mathematical problems. For example, by quantizing, or replacing the weights of a neural network with quantized (e.g., binary) counterparts, massive savings in cost, computation time, memory, and power consumption can be attained. Of course, one wishes to attain these savings while preserving the action of the function on domains of interest. 

We present data-driven and computationally efficient methods for quantizing the weights of already trained neural networks and  we prove that our methods have favorable error guarantees under a variety of assumptions. We also discuss extensions and provide the results of numerical experiments, on large multi-layer networks, to illustrate the  performance of our methods. Time permitting, we will also discuss open problems and related areas of research.

September 15, 2022:  Soledad Villar (Johns Hopkins University)  

Equivariant machine learning from invariant functions (Watch the Video)

In equivariant machine learning the idea is to restrict the learning to a hypothesis class where all the functions are equivariant with respect to some group action. In this talk we describe how irreducible representations and invariant theory are typically used to parameterize the space of such functions. We also explain a general procedure, attributed to Malgrange, to express all polynomial (or all smooth) maps between linear spaces that are equivariant with respect to the action of a compact Lie group, given a characterization of the invariant polynomials on a bigger space.

September 8, 2022:  Kristian Bredies (University of Graz)

Dynamic optimal transport for sparse dynamic superresolution and beyond (Watch the Video)

joint work with Marcello Carioni, Silvio Fanzon and Francisco Romero

We discuss the solution of sparse dynamic superresolution problems in the context of general dynamic inverse problems. Here, for each time point, the problem is to invert a time-dependent linear forward operator mapping the space of measures to a time-dependent Hilbert space for given, possibly noisy data. Such problems are regularized with dynamic optimal-transport energies that base on the continuity equation as well as convex functionals of Benamou-Brenier-type. The functional-analytic framework as well as well-posedness of associated Tikhonov minimization problems are discussed in detail. In particular, we discuss how sparse solutions, i.e., a finite number of moving point masses, can be enforced via the proposed dynamic optimal-transport regularizers. This is closely connected with sparsity results for general inverse problems and the extremal points of the Benamou-Brenier energy subject to the continuity equation. For the latter, it is proven that the extremal points are indeed realized by point masses moving along curves with Sobolev regularity. This result will be employed in numerical optimization algorithms of generalized conditional gradient or Frank-Wolfe type. We present instances of this algorithm that are tailored towards dynamic inverse problems associated with point tracking. In particular, such instances allow for an application to dynamic superresolution, i.e., the recovery from finitely many Fourier coefficients. For the latter, numerical performance of the method is demonstrated for prototypical examples.

Second Year of Talks Below:  July 2021 - June 2022

Organizers (July 2021 - June 2022):  Matthew Hirn (Principal Organizer, Michigan State University), Mark Iwen (Michigan State University), Felix Krahmer (Technische Universität München), Shuyang Ling (New York University Shanghai), Rayan Saab, (University of California, San Diego), Karin Schnass (University of Innsbruck), and Soledad Villar (Johns Hopkins University) 

June 9, 2022:  Fanny Yang (ETH Zurich)

Fast rates for noisy interpolation require rethinking the effects of inductive bias (Watch the Video)

Modern machine learning has uncovered an interesting observation: large over parameterized models can achieve good generalization performance despite interpolating noisy training data. In this talk, we study high-dimensional linear models and show how interpolators can achieve fast statistical rates when their structural bias is moderate. More concretely, while minimum-l2-norm interpolators cannot recover the signal in high dimensions, minimum-l1-interpolators with strong sparsity bias are much more sensitive to noise. In fact, we show that even though they are asymptotically consistent, minimum-l1-norm interpolators converge with a logarithmic rate much slower than the O(1/n) rate of regularized estimators. In contrast, minimum-lp-norm interpolators with 1<p<2 can trade off these two competing trends to yield polynomial rates close to O(1/n). 

June 2:  Florian Boßmann  (Harbin Institute of Technology)

Beyond SVD - two new adaptive models for moving objects (Watch the Video)

Many data processing techniques treat the given data as union of its coefficients. Sparse transformations like Wavelets, Curvelets, Shearlets, etc. transform the data from one coefficient space into another. Inverse transforms used in MRI, CT, inverse scattering, map the measured data back onto a pixel image. Only in a second step, the obtained result is analyzed to extract the desired information. Although the above mentioned methods perform excellent for their designed task, the overall process is still a two step method. Moreover, the intermediate step often seems to arti cially increase the problem size. Do we really need to solve the complete inverse problem, when in the end only a small part contains the information of interest? Exemplary, do we need to reconstruct the full velocity field of seismic waves, when we are only interested in detecting subsurface material boundaries?

We think that instead of viewing data as union of its coefficients, we should see data as union of its information. As the amount of information required is often much smaller than the data size, this already gives implicit sparsity. In many cases the information is directly bounded to an object contained in the data. For example, each car in a video recorded by a traffic camera carries information about the traffic status. A seismic wave in geophysical data carries information about the subsurface conditions. We want to give those physical objects a mathematical model, such that the data can be mapped into the model space where the information can directly be extracted. Some techniques already use this, or a similar approach. For example, a neural network can be trained as classi er to directly extract information out of the data. This technique requires a lot of training data and is not feasible for all applications. Singular value decomposition (SVD) or Principal Component Analysis (PCA) can also been interpreted as such object orientated method. Applying a SVD to video data will return the video background as largest singular vector as long as the camera is not moving. However, the SVD struggles whenever objects in the video are moving. We present two extensions of SVD that are designed to recover moving objects in the data (not only in video data, but also other applications). The first apporach is the so called shifted rank-1 model which allows object movement. The second approach - ORKA - extends this model by allowing the objects to also change form and restricting their movement to a smooth path. (joint work with Jianwei Ma)

References:

[1] F. Bo mann, J. Ma, Enhanced image approximation using shifted rank-1 reconstruction. Inverse Problems and Imaging, 14 (2), 267-290, 2020.

[2] F. Bo mann, J. Ma, ORKA: Object reconstruction using a K-approximation graph, submitted, available on ArXiv, 2022.

May 26:  Selin Aviyente (Michigan State University)

Multiview Graph Learning (Watch the Video)

Modern data analysis involves large sets of structured data, where the structure carries critical information about the nature of the data. These relationships between entities, such as features or data samples, are usually described by a graph structure. While many real-world data are intrinsically graph-structured, e.g. social and traffic networks, there is still a large number of applications, where the graph topology is not readily available. For instance, gene regulations in biological applications or neuronal connections in the brain are not known. In these applications, the graphs need to be learned since they reveal the relational structure and may assist in a variety of learning tasks. Graph learning (GL) deals with the inference of a topological structure among entities from a set of observations on these entities, i.e., graph signals. Most of the existing work on graph learning focuses on learning a single graph structure, assuming that the relations between the observed data samples are homogeneous. However, in many real-world applications, there are different forms of interactions between data samples, such as single-cell RNA sequencing (scRNA-seq) across multiple cell types. This talk will present a new framework for multiview graph learning in two settings: i) multiple views of the same data and ii) heterogeneous data with unknown cluster information. In the first case, a joint learning approach where both individual graphs and a consensus graph are learned will be developed. In the second case, a unified framework that merges classical spectral clustering with graph signal smoothness will be developed for joint clustering and multiview graph learning. 

This is joint work with Abdullah Karaaslanli, Satabdi Saha and Taps Maiti.

May 12:  Pablo Groisman (Universidad de Buenos Aires)

Learning distances to learn topologies, to learn dynamical systems, to learn from chaos (Watch the Video)

Consider a finite sample of points on a manifold embedded in Euclidean space. We'll address the following issues.

1. How we can infer, from the sample, intrinsic distances that are meaningful to understand the data.

2. How we can use these estimated distances to infer the geometry and topology of the manifold (manifold learning).

3. How we can use this knowledge to validate dynamical systems models for chaotic phenomena.

4. Time permitting, we will show applications of this machinery to understand data from the production of songs in canaries.

We will prove that if the sample is given by iid points with density f supported on the manifold, the metric space defined by the sample endowed with a computable metric known as sample Fermat distance converges in the sense of Gromov–Hausdorff. The limiting object is the manifold itself endowed with the population Fermat distance, an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. Then we'll apply this result to estimate the topology of the manifold by constructing intrinsic persistence diagrams (an estimator of the homology of the manifold), which are less sensitive to the particular embedding of the manifold in the Euclidean space and to outliers. We'll also discuss how to use these tools to validate (or to refute) models for chaotic dynamical systems, with applications to the study of birdsongs.

May 5:  Guillaume Lecué (Center for Research in Economics and Statistics)

Robust mean estimation via median-of-means: algorithms and minimax rates (Watch the Video)

Robust mean estimation has witnessed an increasing interest during the last years in both statistics and computer sciences. After a review of the  literature in this domain, we will talk about three complementary works [1,2,3]. 

First in [1], we present the construction of an algorithm,  running in linear time, which is robust to outliers and heavy-tailed data and which achieves the subgaussian rate with respect the canonical Euclidean norm. The algorithm is fully data-driven and does not use in its construction the proportion of outliers nor the optimal subgausssian rate. Its construction combines tools for Median-of-Means estimators and covering-Semi-definite Programming [4,5]. We also show that this algorithm can automatically adapt to the number of outliers.

Then in [2], we consider the problem of robust mean and location estimation with respect to any pseudo-norm. We  first obtain the deviation-optimal minimax subgaussian rate for that problem for Gaussian data. This improves the entropic minimax lower bound from [6] and closes the gap characterized by Sudakov's inequality between the entropy and the Gaussian mean width for this problem. This shows that the right statistical complexity measure for the mean estimation problem with respect to any norm is the Gaussian mean width. 

We also show that this rate can be achieved by a solution to a convex optimization problem in the adversarial corruption model and L2 heavy-tailed setup by considering a minimum of some Fenchel-Legendre transforms constructed using the median-of-means principle. We finally show that this rate may also be achieved in situations where there is not even a first moment but a location parameter exists.

In [3], we show that similar results can be obtained when the norm ||Sigma^{-1/2}.]]_2 (leading to confidence sets with smallest volume)  is used  for the estimation of the mean even though the covariance matrix Sigma is unknown. To that end, we consider median of means (MOM) versions  of  the Stahel-Donoho outlyingness (SDO) [7,8] and of the Median Absolute Deviation (MAD) [9] functions to construct subgaussian estimators of a mean vector under adversarial contamination and heavy-tailed data. We develop a single analysis of the MOM version of the SDO which covers all cases ranging from the Gaussian case to the L_2 case. It is based on isomorphic and almost isometric properties of the MOM versions of SDO and MAD. This analysis also covers cases where the mean does not even exist but a location parameter does; in those cases, we still recover the same subgaussian rates and the same price for adversarial contamination even though there is not even a first moment. In the Gaussian case and for more general spherically invariant distributions, these properties are achieved by the classical SDO median and are therefore the first non-asymptotic statistical bounds on the Stahel-Donoho median complementing the consistency [10] and asymptotic normality [11] of  the Stahel-Donoho estimators.  We also show that the MOM version of MAD can be used to construct an estimator of the covariance matrix under only the existence of a second moment or of a scatter matrix if a second moment does not exist. 

References:

[1] J. Depersin and G. Lecué. Robust subgaussian estimator of a mean vector in nearly linear time.

The Annals of Statistics, volume 50, Number 1, (2022), 511-536.

[2] J. Depersin and G. Lecué. Optimal robust mean and location estimation via convex programs with respect to any pseudo-norms

To appear in Probability theory and related fields.

[3] J. Depersin and G. Lecué. On the robustness to adversarial corruption and to heavy-tailed data of the Stahel-Donoho median of means

Submitted, 2021.

[4] Y. Cheng, I. Diakonikolas, and R. Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2755–2771. SIAM, Philadelphia, PA, 2019.

[5] R. Peng, K. Tangwongsan, and P. Zhang. Faster and simpler width-independent parallel algorithms for positive semidef- inite programming, 2012.

[6] G. Lugosi and S. Mendelson. Near-optimal mean estimators with respect to general norms. Probab. Theory Related Fields, 175(3-4):957--973, 2019.

[7] W. Stahel. Robuste schätzungen: infinitesimale optimalität und schätzungen von kovarianzmatrizen. PhD thesis, ETH Zurich, 1981.

[8] D. Donoho. Breakdown properties of multivariate location estimators. PhD thesis, Ph. D. Qualifying Paper, Dept. of Statistics, Harvard Univ.

[9] F.R. Hampel. Robust estimation: a condensed partial survey. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 27:87--104, 1973.

[10] R.A. Maronna and V.J. Yohai. The behavior of the stahel-donoho robust multivariate estimator. Journal of the American Statistical Association, 90(429):330--341, 1995.

[11] Y. Zuo, H. Cui, and X. He. On the Stahel-Donoho estimator and depth-weighted means of multivariate data. Ann. Statist., 32(1):167--188, 2004.

April 28:  Haizhao Yang (Purdue University)

Discretization-Invariant Operator Learning: Algorithms and Theory (Watch the Video)

Learning operators between infinitely dimensional spaces is an important learning task arising in wide applications in machine learning, data science, mathematical modeling and simulations, etc. This talk introduces a new discretization-invariant operator learning approach based on data-driven kernels for sparsity via deep learning. Compared to existing methods, our approach achieves attractive accuracy in solving forward and inverse problems, prediction problems, and signal processing problems with zero-shot generalization, i.e., networks trained with a fixed data structure can be applied to heterogeneous data structures without expensive re-training. Under mild conditions, quantitative generalization error will be provided to understand discretization-invariant operator learning in the sense of non-parametric estimation.

April 21:  Bin Dong (Peking University)

Learning to Solve Parametric PDEs (Watch the Video)

Deep learning continues to dominate machine learning and has been successful in computer vision, natural language processing, etc. Its impact has now expanded to many research areas in science and engineering. In this talk, I will present our recent work on combining wisdom from numerical PDEs (e.g., multigrid method) and machine learning to design data-driven solvers for parametric PDEs and their applications in electromagnetic simulation.

April 14:  Ozgur Yilmaz (The University of British Columbia)

Inverting generative models and applications in inverse problems (Watch the Video)

Obtaining accurate signal models is a fundamental problem when solving inverse problems. This is because the typical inverse problem, be it denoising, inpainting, compressed sensing, phase retrieval, medical or seismic imaging, is ill-posed. To find an accurate solution tractably, we need a "good" model of the underlying signal class that can be "imposed" on the solution. Well known, now classical, examples of such models include sparsity with respect to a known basis (compressed sensing), having small total variation (image denoising), and having low rank (matrix and tensor completion). In the Bayesian framework, which provides an alternative approach, signals in a particular class are modeled as realizations of a random vector. The objective is then to model the (prior) distribution of this random vector and use that to identify the posterior distribution, which leads to, for example, the MAP estimator. Recent advances in training deep neural networks have shown that various signal classes can be modeled using generative models leading to generator networks that map a low-dimensional latent space to the high-dimensional signal space and provide a straight-forward way of sampling the signal space according to the prior distribution. Such models were successfully incorporated in the literature to the Bayesian framework to empirically solve certain inverse problems involving, e.g., handwritten digits and faces.

In this talk, we present an algorithm, which we call PLUGIn (Partially Linearized Update for Generative Inversion), for recovering an unknown latent (code) vector from (noisy) observations of a signal, under a known generative model with provable guarantees. Furthermore, while requiring an overall expansivity, our convergence guarantees hold for networks with some contractive layers. We will also show that PLUGIn is versatile and can be adopted to various inverse problems such as compressed sensing and reconstruction from quantized measurements.

This is joint work with Babhru Joshi, Xiaowei Li, and Yaniv Plan.

April 7:  Mihai Cucuringu (University of Oxford)

Spectral methods for clustering signed and directed networks (Watch the Video)

We consider the problem of clustering in two important families of networks: signed and directed, both relatively less well explored compared to their unsigned and undirected counterparts. Both problems share an important common feature: they can be solved by exploiting the spectrum of certain graph Laplacian matrices or derivations thereof. In signed networks, the edge weights between the nodes may take either positive or negative values, encoding a measure of similarity or dissimilarity. We consider a generalized eigenvalue problem involving graph Laplacians, with performance guarantees under the setting of a signed stochastic block model, along with regularized versions to handle very sparse graphs. The second problem concerns directed graphs. Imagine a (social) network in which you spot two subsets of accounts, X and Y, for which the overwhelming majority of messages (or friend requests, endorsements, etc) flow from X to Y, and very few flow from Y to X; would you get suspicious? To this end, we also discuss a spectral clustering algorithm for directed graphs based on a complex-valued representation of the adjacency matrix, which is able to capture the underlying cluster structures, for which the information encoded in the direction of the edges is crucial. We evaluate the proposed algorithm in terms of a cut flow imbalance-based objective function, which, for a pair of given clusters, it captures the propensity of the edges to flow in a given direction. Experiments on a directed stochastic block model and real-world networks showcase the robustness and accuracy of the method, when compared to other state-of-the-art methods. Time permitting, we briefly discuss connections to ranking from pairwise comparisons data and group synchronization.


March 24:  Alexandre d'Aspremont (CNRS and École Normale Supérieure)

Approximation Bounds for Sparse Programs (Watch the Video)

We show that sparsity-constrained optimization problems over low dimensional spaces tend to have a small duality gap. We use the Shapley-Folkman theorem to derive both data-driven bounds on the duality gap, and an efficient primalization procedure to recover feasible points satisfying these bounds. These error bounds are proportional to the rate of growth of the objective with the target cardinality k, which means in particular that the relaxation is nearly tight as soon as k is large enough so that only uninformative features are added.

March 17:  Radu Balan (University of Maryland)

Low-Dimensional Lipschitz Embeddings Invariant to Permutations (Watch the Video)

In this talk we present a Lipschitz embedding of the quotient space generated by matrices that are identified modulo arbitrary row permutations.  The original application is in deep learning on graphs where the learning task is invariant to node relabeling. Two embedding schemes are discussed, one based on sorting and the other based on algebras of multivariate polynomials (tensors). While both embeddings exhibit a computational complexity exponential in problem size, the sorting based embedding is globally bi-Lipschitz and admits a low dimensional target space. Additionally, an almost everywhere injective scheme can be implemented with minimal redundancy and low computational cost.  In turn, this proves that almost any classifier can be implemented with an arbitrary small loss of performance. Numerical experiments are carried out on the QM9, a chemical compound data set, and PROTEINS_FULL, a protein data set.

March 10:  Markus Faulhuber (University of Vienna)

Optimal sampling patterns in the time-frequency plane (Watch the Video)

We present a solution to a long standing problem on optimal sampling strategies in the time-frequency plane. The setup is the following: using time-frequency shifted Gaussian functions we know how to stably (re)construct any function from the Hilbert space of square integrable functions on the line. However, the stability of the reconstruction depends on the underlying sampling patter. The conjecture of Strohmer and Beaver stated that the hexagonal lattice should yield the minimal condition number of the associated frame operator. Recently, this conjecture has been solved affirmatively. This is joint work with Laurent Betermin (University of Lyon) and Stefan Steinerberger (University of Washington, Seattle).

March 3:  Boris Hanin (Princeton University)

Random Fully Connected Neural Networks as Perturbatively Solvable Models (Watch the Video)

Fully connected networks are roughly described by two structural parameters: a depth L and a width n. It is well known that, with some important caveats on the scale at initialization, in the regime of fixed L and the limit of infinite n, neural networks at the start of training are a free (i.e. Gaussian) field and that network optimization is kernel regression for the so-called neural tangent kernel (NTK). This is a striking and insightful simplification of infinitely overparameterized networks. However, in this particular infinite width limit neural networks cannot learn data-dependent features, which is perhaps their most important empirical feature. To understand feature learning one must therefore study networks at finite width. In this talk I will do just that. I will report on recent work joint with Dan Roberts and Sho Yaida (done at a physics level of rigor) and some more mathematical ongoing work which allows one to compute, perturbatively in 1/n and recursively in L, all correlation functions of the neural network function (and its derivatives) at initialization. An important upshot is the emergence of L/n, instead of simply L, as the effective network depth. This cut-off parameter provably measures the extent of feature learning and the distance at initialization to the large n free theory.

February 24:  Martin Lotz (University of Warwick)

Concentration of Measure in Geometric Probability and Applications (Watch the Video)

Old problems in geometric probability have received renewed attention due to applications in optimization and statistics. Some of these applications are centred around the notion of intrinsic volumes, fundamental geometric invariants that include the Euler characteristic and the volume. Important results in integral geometry relate the intrinsic volumes of random projections, intersections, and sums of convex bodies to those of the individual volumes. We present a new interpretations of classic results, based on the observation that intrinsic volumes (both in spherical and Euclidean settings) concentrate around certain indices. I will give an overview of some old and new developments in this direction, and discuss further applications, for example to the convergence analysis of randomized algorithms. This is joint work with Joel Tropp.

February 17:  Caroline Moosmüller (University of California, San Diego)

Efficient distribution classification via optimal transport embeddings (Watch the Video)

Detecting differences and building classifiers between distributions, given only finite samples, are important tasks in a number of scientific fields. Optimal transport (OT) has evolved as the most natural concept to measure the distance between distributions, and has gained significant importance in machine learning in recent years.  There are some drawbacks to OT: Computing OT can be slow, and it often fails to exploit reduced complexity in case the family of distributions is generated by simple group actions.  In this talk, we discuss how optimal transport embeddings can be used to deal with these issues, both on a theoretical and a computational level.  In particular, we’ll show how to embed the space of distributions into an L^2-space via OT, and how linear techniques can be used to classify families of distributions generated by simple group actions in any dimension. The proposed framework significantly reduces both the computational effort and the required training data in supervised settings. We demonstrate the benefits in pattern recognition tasks in imaging and provide some medical applications.

This talk is based on joint work with Alex Cloninger, Harish Kannan, Varun Khurana, and Jinjie Zhang.

February 10:  Johannes Maly (Catholic University of Eichstaett-Ingolstadt)

Why is gradient descent so biased? (Watch the Video)

In deep learning it is common to overparameterize neural networks, that is, to use more parameters than training samples. Quite surprisingly, training these networks via (stochastic) gradient descent leads to models that generalize very well, while classical statistics would suggest overfitting. In this talk, we theoretically analyze the behaviour of vanilla gradient flow/descent in two simplified settings: (i) matrix factorization (which can be seen as training linear neural networks without bias) and (ii) sparse recovery. Whereas in (i) the iterates follow a path of low (effective) rank, in (ii) the limit (approximately) minimizes the $\ell_1$-norm among all possible solutions (under very mild assumptions on the measurement matrix).

This talk is based on joint work with Hung-Hsu Chou, Carsten Gieshoff, and Holger Rauhut.

February 3:  Ankur Moitra (MIT)

Algorithmic Foundations for the Diffraction Limit (Watch the Video)

For more than a century and a half it has been widely-believed that the physics of diffraction imposes certain fundamental limits on the resolution of an optical system. However our understanding of what exactly can and cannot be resolved has never risen above heuristic arguments which, even worse, appear contradictory.

In this work we remedy this gap by studying the diffraction limit as a statistical inverse problem and, based on connections to provable algorithms for learning mixture models, we rigorously prove upper and lower bounds on how many photons we need (and how precisely we need to record their locations) to resolve closely-spaced point sources. Moreover we show the emergence of a phase transition, which helps explain why the diffraction limit can be broken in some domains but not in others.

This is based on joint work with Sitan Chen.

January 27:  Yong Sheng Soh (National University of Singapore)

Learning Sparse Representations with Symmetries (Watch the Video

In this talk, we consider the problem of learning sparse representations for data under the constraint that the representation satisfies some desired invariance.  The problem of learning sparse representations is more typically referred to as dictionary learning in the literature, and the specific task we consider can be viewed as an extension of the convolutional dictionary learning problem. 

Building on ideas from group representation theory, harmonic analysis, and convex geometry, we describe an end-to-end recipe for learning such data representations that are invariant to a fairly broad family of symmetries, and in particular, continuous ones.  Our techniques draw connections between our learning problem and the geometric problem of fitting appropriately parameterized orbitopes to data.  Spectrahedral descriptions of certain orbitopes based on Toeplitz positive semidefinite matrices feature prominently in our work.

January 20, 2022:  Ilse Ipsen (North Carolina State University)

BayesCG: A probabilistic numeric linear solver (Watch the Video)

We present the probabilistic numeric solver BayesCG, for solving linear systems with real symmetric positive definite coefficient matrices. BayesCG is an uncertainty aware extension of the conjugate gradient (CG) method that performs solution-based inference with Gaussian distributions to capture the uncertainty in the solution due to early termination. Under a structure exploiting `Krylov' prior, BayesCG produces the same iterates as CG. The Krylov posterior covariances have low rank, and are maintained in factored form to preserve symmetry and  positive semi-definiteness. This allows efficient generation of accurate samples to probe uncertainty in subsequent computations.

December 16:  Emily King (Colorado State University)

A Potpourri of Mathematical Analyses of Neural Networks (Watch the Video)

Artificial neural networks have proven themselves to be useful in a wide range of applications but operate more-or-less as black boxes.  It is possible to analyze neural networks based on the functional properties of the individual layers or the network as a whole or based on the mathematical properties of actual data as it travels through the network. In this talk, after a gentle introduction to neural networks, examples of these three approaches to mathematically analyze neural networks will be presented. The methods will draw on ideas from linear algebra, random geometry, and approximation theory.

December 9: Philipp Petersen (University of Vienna) 

Optimal learning of classifier functions (Watch the Video)

Deep learning has established itself as, by far, the most successful machine learning approach in sufficiently complex tasks. Nowadays, it is used in a wide range of highly complex applications such as natural language processing or even scientific applications. Its first major breakthrough, however, was achieved by shattering the state-of-the-art in image classification. We revisit the problem of classification by deep neural networks and attempt to find an answer to why deep networks are remarkably effective in this regime. We will interpret the learning of classifiers as finding piecewise constant functions from labelled samples. We then precisely link the hardness of the learning problem to the complexity of the regions. Concretely, we will establish fundamental lower bounds on the learnability of certain regions. Finally, we will show that in many cases, these optimal bounds can be achieved by deep-neural-network-based learning. 

In quite realistic settings, we will observe that deep neural networks can learn high-dimensional classifiers without a strong dependence of the learning rates on the dimension. 

December 2: John Harlim (The Pennsylvania State University) 

Diffusion maps on manifolds with boundaries for solving PDEs (Watch the Video)

I will discuss recent efforts in using the Diffusion Maps (DM) algorithm to solve elliptic PDEs on unknown manifolds using point clouds data. The key idea rests on the fact that away from the boundary, the second-order elliptic differential operators can be approximated by integral operators defined with appropriate Gaussian kernels. On manifolds with boundary, however, such an approximation is only valid for functions that satisfy the Neumann boundary condition. Motivated by the classical ghost-point correction in the finite-difference method for solving Neumann problems, we extend the diffusion maps algorithm with ghost points such that it is a consistent estimator in the pointwise sense even near the boundary. Applying the proposed algorithm, which we called the Ghost Points Diffusion Maps (GPDM), to solve the well-posed elliptic PDEs with Dirichlet, Neumann, or Robin boundary conditions, we establish the convergence of the approximate solution under appropriate smoothness assumptions. I will also discuss a neural-network regression solution on an algebraic equation induced by the DM/GPDM discretization. In addition to theoretical convergence for a wide enough neural network model, we numerically found that the proposed solver avoids the classical large matrix inversion problem. I will also discuss the application of DM/GPDM on Bayesian elliptic inverse problems. If time permits, I will also present symmetric Graph-Laplacian matrices that spectrally converge to the Laplace-Beltrami operator defined on compact manifolds with homogeneous Neumann and Dirichlet boundary conditions.

November 25: Alexandra Carpentier (Universität Potsdam) 

Optimal ranking in crowd sourcing (Watch the Video)

Consider a crowd sourcing problem where we have n experts and d tasks. The average ability of each expert for each task is stored in an unknown matrix M, which is only observed in noise and incompletely. We make no (semi) parametric assumptions, but assume that both experts and tasks can be perfectly ranked: so that if an expert is better than another, she performs on average better on all tasks than the other - and that the same holds for the tasks. This implies that if the matrix M is permuted so that the experts and tasks are perfectly ranked, then the permuted matrix M is bi-isotonic.

We focus on the problem of recovering the optimal ranking of the experts in l_2 norm, when the questions are perfectly ranked. We provide a minimax-optimal and computationally feasible method for this problem, based on hierarchical clustering, PCA, and exchange of informations among the clusters. We prove in particular - in the case where d > n - that the problem of estimating the expert ranking is significantly easier than the problem of estimating the matrix M.

This talk is based on joint work with Emmanuel Pilliat and Nicolas Verzelen.

November 18: Lorenzo Rosasco (MIT/University of Genova)

Interpolation and learning with scale dependent kernels  (Watch the Video)

We study the learning properties of nonparametric ridge-less least squares. In particular, we consider the common case of estimators defined by scale dependent (Matern) kernels, and focus on the role scale and smoothness. These estimators interpolate the data and the scale can be shown to control their stability to noise and sampling. Larger scales, corresponding to smoother functions, improve stability with respect to sampling. However, smaller scales, corresponding to more complex functions, improve stability to noise. We will discuss to which extent these results can explain the learning curves observed for large overparameterized models. Our analysis combines, probabilistic results with analytic techniques from interpolation theory.

November 11: Sjoerd Dirksen (Utrecht University)

The separation capacity of random neural networks (Watch the Video)

Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. The goal of this talk is to enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under what conditions can a random neural network make two classes (with positive distance) linearly separable? I will show that a sufficiently large two-layer ReLU-network with Gaussian weights and uniformly distributed biases can solve this problem with high probability. The number of required neurons in the two layers is explicitly linked to geometric properties of the two sets and their mutual arrangement. This instance-specific viewpoint allows to overcome the curse of dimensionality (exponential width of the layers). I will connect the presented separation result with related lines of work on approximation, memorization, and generalization.

The talk is based on joint work with Martin Genzel, Laurent Jacques, and Alexander Stollenwerk (arXiv:2108.00207).

November 4: Weilin Li (New York University)

Super-resolution, subspace methods, and Fourier matrices (Watch the Video)

This talk is concerned with the inverse problem of recovering a discrete measure on the torus given a finite number of its noisy Fourier coefficients. We focus on the diffraction limited regime where at least two atoms are closer together than the Rayleigh length. We show that the fundamental limits of this problem and the stability of subspace (algebraic) methods, such as ESPRIT and MUSIC, are closely connected to the minimum singular value of non-harmonic Fourier matrices. We provide novel bounds for the latter in the case where the atoms are located in clumps. We also provide an analogous theory for a statistical model, where the measure is time-dependent and Fourier measurements are collected over at various times.  Joint work with Wenjing Liao, Albert Fannjiang, Zengying Zhu, and Weiguo Gao.

October 28: Rongrong Wang (Michigan State University)

Sigma Delta quantization on images, manifolds, and graphs (Watch the Video)

In digital signal processing, quantization is the step of converting a signal's real-valued samples into a finite string of bits. As the first step in digital processing, it plays a crucial role in determining the information conversion rate and the reconstruction accuracy. Compared to non-adaptive quantizers, the adaptive ones are known to be more efficient in quantizing bandlimited signals, especially when the bit-budget is small (e.g.,1 bit) and noises are present.

However, adaptive quantizers are currently only designed for 1D functions/signals. In this talk, I will discuss challenges in extending it to high dimensions and present our proposed solutions. Specifically, we design new adaptive quantization schemes to quantize images/videos as well as functions defined on 2D surface manifolds and general graphs, which are common objects in signal processing and machine learning. Mathematically, we start from the 1D Sigma-Delta quantization, extend them to high-dimensions and build suitable decoders. The discussed theory would be useful in natural image acquisition, medical imaging,  3D printing, and graph embedding.

October 14: Edgar Dobriban (University of Pennsylvania)

Asymptotic perspectives on sketching (Watch the Video) 

Sketching and random projection methods are a powerful set of techniques to speed up computations in numerical linear algebra, statistics, machine learning, optimization and data science. In this talk, we will discuss some of our works on developing a "big data" asymptotic perspective on sketching in the fundamental problems of linear regression and principal component analysis. This can lead to remarkably clean and elegant mathematical results, which yield powerful insights into the performance of various sketching methods. To highlight one, orthogonal sketches such as the Subsampled Randomized Hadamard Transform are provably better than iid sketches such as Gaussian sketching. This is obtained by using deep recent tools from asymptotic random matrix theory and free probability, including asymptotically liberating sequences (Anderson & Farrell, 2014). This is based on joint works with Jonathan Lacotte, Sifan Liu, Mert Pilanci, David P. Woodruff, and Fan Yang.

October 7: Afonso Bandeira (ETHZ - Swiss Federal Institute of Technology Zürich)

Noncommutative Matrix Concentration Inequalities (Watch the Video)

Matrix Concentration inequalities such as Matrix Bernstein inequality have played an important role in many areas of pure and applied mathematics. These inequalities are intimately related to the celebrated noncommutative Khintchine inequality of Lust-Piquard and Pisier. In the middle of the 2010's, Tropp improved the dimensional dependence of this inequality in certain settings by leveraging cancellations due to non-commutativity of the underlying random matrices, giving rise to the question of whether such dependency could be removed.

In this talk we leverage ideas from Free Probability to fully remove the dimensional dependence in a range of instances, yielding optimal bounds in many settings of interest. As a byproduct we develop matrix concentration inequalities that capture non-commutativity (or, to be more precise, ``freeness''), improving over Matrix Bernstein in a range of instances. No background knowledge of Free Probability will be assumed in the talk.

Joint work with March Boedihardjo and Ramon van Handel, more information at arXiv:2108.06312 [math.PR].

September 30: Michael Perlmutter (UCLA)

Neural Networks on (Directed) Graphs (Watch the Video)

The prevalence of graph-based data has spurred the rapid development of graph neural networks (GNNs) and related machine learning algorithms. These methods extend convolutions to graphs either in the spatial domain as a localized averaging operator or in the spectral domain via the eigendecomposition of a suitable Laplacian. However, most popular GNNs have two limitations. i) The filters used in these networks are essentially low-pass filters (i.e. averaging operators). This leads to the so called ``oversmoothing problem'' and the loss of high-frequency information. ii) If the graph is directed, as is the case in many applications including citation, website, and traffic networks, these networks are unable to effectively encode directional information. In this talk, discuss how we can overcome these limitations via i) the graph scattering transform, which uses band-pass filters rather than low-pass, and ii) MagNet, a network designed for directed graphs based on a complex Hermitian matrix known as the magnetic Laplacian.

September 23: Joel Tropp (California Institute of Technology)

Scalable semidefinite programming (Watch the Video)

Semidefinite programming (SDP) is a powerful framework from convex optimization that has striking potential for data science applications. This talk describes a provably correct randomized algorithm for solving large, weakly constrained SDP problems by economizing on the storage and arithmetic costs. Numerical evidence shows that the method is effective for a range of applications, including relaxations of MaxCut, abstract phase retrieval, and quadratic assignment problems. Running on a laptop equivalent, the algorithm can handle SDP instances where the matrix variable has over 10^14 entries.

This talk will highlight the ideas behind the algorithm in a streamlined setting. The insights include a careful problem formulation, design of a bespoke optimization method, and use of randomized matrix computations.

Joint work with Alp Yurtsever, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. Based on arXiv 1912.02949 (Scalable SDP, SIMODS 2021) and other papers (SketchyCGM in AISTATS 2017, Nyström sketch in NeurIPS 2017).

September 16: Russell Luke (University of Göttingen)

Structured (In)Feasibility:  Nonmonotone Operator Splitting in Nonlinear Spaces (Watch the Video)

The success of operator splitting techniques for convex optimization has led to an explosion of methods for solving large-scale and nonconvex optimization problems via convex relaxation.  This success is at the cost of overlooking direct approaches to operator splitting that embrace some of the more inconvenient aspects of many model problems, namely nonconvexity, nonsmoothness and infeasibility.  I will introduce some of the tools we have developed for handling these issues, and present sketches of the basic results we can obtain.  The formalism is in general metric spaces, but most applications have their basis in Euclidean spaces.  Along the way I will try to point out connections to other areas of intense interest, such as optimal mass transport.

September 9: Anna Ma (University of California, Irvine)

The Kaczmarz Algorithm: Greed, Randomness, and Tensors

The Kaczmarz algorithm is an iterative method for solving linear systems of equations of the form Ax=y. Owing to its low memory footprint, the Kaczmarz algorithm has gained popularity for its practicality in applications to large-scale data, acting only on single rows of A at a time. In this talk, we discuss selecting rows of A randomly (Randomized Kaczmarz), selecting rows in a greedy fashion (Motzkin's Method), and selecting rows in a partially greedy fashion (Sampling Kaczmarz-Motzkin algorithm). Despite their variable computational costs, these algorithms have been proven to have the same theoretical upper bound on the convergence rate. Here we present an improvement upon previous known convergence bounds of the Sampling Kaczmarz-Motzkin algorithm, capturing the benefit of partially greedy selection schemes. Time permitting, we also will discuss an extension of the Kaczmarz algorithm to the setting where data takes on the form of a tensor and make connections between the new Tensor Kaczmarz algorithm and previously established algorithms. This presentation contains joint work with Jamie Haddock and Denali Molitor.

September 2: Jonathan Scarlett (National University of Singapore)

Beyond Sparsity: Compressive Sensing with (Deep) Generative Modeling Assumptions (Watch the Video)

The problem of estimating an unknown vector from linear measurements has a long history in statistics, machine learning, and signal processing. Classical studies focus on the "n >> p" regime (#measurements >> #parameters), and more recent studies handle the "n << p" regime by exploiting low-dimensional structure such as sparsity or low-rankness. Such variants are commonly known as compressive sensing.

In this talk, I will overview recent methods that move beyond these simple notions of structure, and instead assume that the underlying vector is well-modeled by a generative model (e.g., produced by deep learning methods such as GANs). I will highlight algorithmic works that demonstrated up to 5-10x savings in the number of measurements over sparsity-based methods, and then move on to our theoretical work characterizing the order-optimal sample complexity in terms of quantities such as (i) the Lipschitz constant of the model, or (ii) the depth/width in a neural network model. I will also briefly highlight some recent results on non-linear observation models.

August 26: Deanna Needell (UCLA)

On the topic of topic modeling: enhancing machine learning approaches with topic features (Watch the Video)

In this talk we touch on several problems in machine learning that can benefit from the use of topic modeling. We present topic modeling based approaches for online prediction problems, computer vision, text generation, and others. While these problems have classical machine learning approaches that work well, we show that by incorporating contextual information via topic features, we obtain enhanced and more realistic results. These classical methods include non-negative matrix and tensor factorization, generative adversarial networks, and even traditional epidemiological SIR models for prediction. In this talk we provide a brief overview of these problems and show how topic features can be used in these settings. We include supporting theoretical and experimental evidence that showcases the broad use of our approaches.

August 19: Ladislav Rampášek (Université de Montréal and Mila)

Hierarchical Graph Nets and Long-Range Interactions (Watch the Video)

Graph neural networks (GNNs) based on message passing between neighboring nodes are known to be insufficient for capturing long-range interactions in graphs. Here we study hierarchical message passing models that leverage a multi-resolution representation of a given graph. This facilitates learning of features that span large receptive fields without loss of local information, an aspect not studied in preceding work on hierarchical GNNs. We introduce Hierarchical Graph Net (HGNet), which for any two connected nodes guarantees existence of message-passing paths of at most logarithmic length w.r.t. the input graph size. Yet, under mild assumptions, its internal hierarchy maintains asymptotic size equivalent to that of the input graph. We observe that our HGNet outperforms conventional stacking of GCN layers particularly in molecular property prediction benchmarks. Finally, we propose two benchmarking tasks designed to elucidate capability of GNNs to leverage long-range interactions in graphs.

August 12: Andrea Bertozzi (UCLA)

The challenges of modeling pandemic spread with early time data, finite size populations, and opinion dynamics

The coronavirus disease 2019 (COVID-19) pandemic placed epidemic modeling at the forefront of worldwide public policy making. Nonetheless, modeling and forecasting the spread of COVID-19 remain a challenge.  This talk begins with a review of the historical use of epidemic models and addresses the challenges of choosing a model in the early stages of a worldwide pandemic.   The spread of COVID-19 has illustrated the heterogeneity of disease spread at different population levels.  With finite size populations, chance variations may lead to significant departures from the mean.  In real-life applications, finite size effects arise from the variance of individual realizations of an epidemic course about its fluid limit.  I will illustrate how to model this variance with a martingale formulation consisting of a deterministic and a stochastic component, along with estimates for the size of the variance compared to real world data and simulations.  Another cause of heterogeneities is the differing attitudes at the population level for control measures such as mask-wearing and physical distancing. Often, individuals form opinions about their behaviors from social network opinions.    I will show some results from a two-layer multiplex network for the coupled spread of a disease and conflicting opinions. We model each process as a contagion. We develop approximations of mean-field type by generalizing monolayer pair approximations to multilayer networks; these approximations agree well with Monte Carlo simulations for a broad range of parameters and several network structures. We find that lengthening the duration that individuals hold an opinion may help suppress disease transmission, and we demonstrate that increasing the cross-layer correlations or intra-layer correlations of node degrees may lead to fewer individuals becoming infected with the disease.

August 5: Alex Wein (NYU)

Optimal Sparse Recovery of a Planted Vector in a Subspace (Watch the Video)

We consider the task of recovering a pN-sparse vector planted in an n-dimensional random subspace of R^N, given an arbitrary basis for the subspace. We give an improved analysis of (a slight variant of) a spectral method proposed by Hopkins, Schramm, Shi, and Steurer (STOC 2016), showing that it succeeds when np << sqrt(N). This condition improves upon the best known polynomial-time guarantees of any algorithm. Our analysis also applies to the dense case p=1, provided that the planted vector has non-Gaussian entries. Furthermore, we give a matching lower bound, showing that when np >> sqrt(N), a general class of spectral methods fail to detect the planted vector. Specifically, we rule out any method based on thresholding the leading eigenvalue of any polynomial-size matrix whose entries are constant-degree polynomials of the input. This yields a tight characterization of the power of spectral methods and suggests that no polynomial-time algorithm can succeed when np >> sqrt(N).

This is joint work with Cheng Mao, available at https://arxiv.org/abs/2105.15081

July 29: Wotao Yin (Alibaba Damo Academy)

Learning to Optimize: Fixed Point Network (Watch the Video)

Many applications require repeatedly solving a certain optimization problem, each time with new but similar data. “Learning to optimize” or L2O is an approach to develop algorithms that solve these similar problems very efficiently. L2O-generated algorithms have achieved significant success in signal processing and inverse-problem applications. This talk introduces the motivation for L2O and gives a quick overview of different types of L2O approaches for continuous optimization. Then, we will introduce Fixed Point Networks (FPNs), which incorporate fixed-point iterations into deep neural networks and provide abilities such as physics-based inversion, data-driven regularization, encoding hard constraints, and infinite depth. The FPNs are easy to train with a new Jacobian-free back propagation (JFB) scheme.

July 22: Yaniv Plan (University of British Columbia)

A family of measurement matrices for generalized compressed sensing (Watch the Video)

We consider the problem of recovering a structured signal x that lies close to a subset of interest T in R^n, from its random noisy linear measurements y = B A x + w, i.e., a generalization of the classic compressed sensing problem. Above, B is a fixed matrix and A has independent sub-gaussian rows. By varying B, and the sub-gaussian distribution of A, this gives a family of measurement matrices which may have heavy tails, dependent rows and columns, and singular values with large dynamic range. Typically, generalized compressed sensing assumes a random measurement matrix with nearly uniform singular values (with high probability), and asks: How many measurements are needed to recover x? In our setting, this question becomes: What properties of B allow good recovery? We show that the “effective rank” of B may be used as a surrogate for the number of measurements, and if this exceeds the squared Gaussian complexity of T-T then accurate recovery is guaranteed. We also derive the optimal dependence on the sub-gaussian norm of the rows of A, to our knowledge this was not known previously even in the case when B is the identity. We allow model mismatch and inexact optimization with the aim of creating an easily accessible theory that specializes well to the case when T is the range of an expansive neural net.

July 15, 2021: Qi (Rose) Yu (University of California, San Diego)

Equivariant Neural Networks for Learning Spatiotemporal Dynamics (Watch the Video)

Applications such as climate science and transportation require learning complex dynamics from large-scale spatiotemporal data. Existing machine learning frameworks are still insufficient to learn spatiotemporal dynamics as they often fail to exploit the underlying physics principles. Representation theory can be used to describe and exploit the symmetry of the dynamical system. We will show how to design neural networks that are equivariant to various symmetries for learning spatiotemporal dynamics.  Our methods demonstrate significant improvement in prediction accuracy, generalization, and sample efficiency in forecasting turbulent flows and predicting real-world trajectories.  This is joint work with Robin Walters, Rui Wang, and Jinxi Li.

First Year of Talks Below:  April 2020 - June 2021

Founding Organizers (April 2020 - June 2021):  Mark Iwen (Principal Organizer, Michigan State University), Bubacarr Bah (African Institute for Mathematical Sciences South Africa), Afonso Bandeira (ETH-Zurich), Matthew Hirn (Michigan State University), Felix Krahmer (Technische Universität München), Shuyang Ling (New York University Shanghai), Ursula Molter (Universidad de Buenos Aires), Deanna Needell (University of California, Los Angeles), Rayan Saab, (University of California, San Diego), and Rongrong Wang (Michigan State University)

June 24, 2021: Qiang Ye (University of Kentucky)

Batch Normalization and Preconditioning for Neural Network Training (Watch the Video)

Batch normalization (BN) is a popular and ubiquitous method in deep neural network training that has been shown to decrease training time and improve generalization performance. Despite its success, BN is not theoretically well understood. It is not suitable for use with very small mini-batch sizes or online learning. In this talk, we will review BN and present a preconditioning method called Batch Normalization Preconditioning (BNP) to accelerate neural network training. We will analyze the effects of mini-batch statistics of a hidden variable on the Hessian matrix of a loss function and propose a parameter transformation that is equivalent to normalizing the hidden variables to improve the conditioning of the Hessian. Compared with BN, one benefit of BNP is that it is not constrained on the mini-batch size and works in the online learning setting. We will present several experiments demonstrating competitiveness of BNP. Furthermore, we will discuss a connection to BN which provides theoretical insights on how BN improves training and how BN is applied to special architectures such as convolutional neural networks.

The talk is based on a joint work with Susanna Lange and Kyle Helfrich.

June 17: Wenjing Liao (Georgia Tech)

Regression and doubly robust off-policy learning on low-dimensional manifolds by neural networks (Watch the Video)

Many data in real-world applications are in a high-dimensional space but exhibit low-dimensional structures. In mathematics, these data can be modeled as random samples on a low-dimensional manifold. Our goal is to estimate a target function or learn an optimal policy using neural networks. This talk is based on an efficient approximation theory of deep ReLU networks for functions supported on a low-dimensional manifold. We further establish the sample complexity for regression and off-policy learning with finite samples of data. When data are sampled on a low-dimensional manifold, the sample complexity crucially depends on the intrinsic dimension of the manifold instead of the ambient dimension of the data. These results demonstrate that deep neural networks are adaptive to low-dimensional geometric structures of data sets. This is a joint work with Minshuo Chen, Haoming Jiang, Liu Hao, Tuo Zhao at Georgia Institute of Technology.

June 10: Piotr Indyk (MIT)

Learning-Based Sampling and Streaming (Watch the Video)

Classical algorithms typically provide "one size fits all" performance, and do not leverage properties or patterns in their inputs. A recent line of work aims to address this issue by developing algorithms that use machine learning predictions to improve their performance. In this talk I will present two examples of this type, in the context of streaming and sampling algorithms. In particular, I will show how to use machine learning predictions to improve the performance of (a) low-memory streaming algorithms for frequency estimation (ICLR’19), and (b) sampling algorithms for estimating the support size of a distribution (ICLR’21).   Both algorithms use an ML-based predictor that, given a data item, estimates the number of times the item occurs in the input data set.  

The talk will cover material from papers co-authored with  T Eden, CY Hsu, D Katabi, S Narayanan, R Rubinfeld, S Silwal, T Wagner and A Vakilian.

June 3: Gerlind Plonka-Hoch (University of Göttingen)

Recovery of sparse signals from their Fourier coefficients (Watch the Video)

In this talk, we study a new recovery procedure for non-harmonic signals, or more generally for extended exponential sums y(t), which are determined by a finite number of parameters.  For the reconstruction  we employ a finite set of classical Fourier coefficients of y(t). Our new recovery method is based on the observation that the Fourier coefficients of y(t) possess a special rational structure.  We apply the recently proposed AAA algorithm by Nakatsukasa et al. (2018) to recover this rational structure in a stable way. If a sufficiently large set of Fourier coefficients of y(t) is available, then our recovery method automatically detects the correct number of terms M of the exponential sums y(t) and reconstructs all unknown parameters of the signal model. Our method provides a new stable alternative to the known numerical approaches for the recovery of exponential sums that are based on Prony's method.

These results have been obtained jointly with Markus Petz and Nadiia Derevianko.

May 27: Akram Aldroubi (Vanderbilt University)

Dynamical Sampling: A Sampling Tour (Watch the Video)

Dynamical sampling is a term describing an emerging set of problems related to recovering signals and evolution operators from space-time samples. For example, one problem is to recover a function f from space-time samples {(A_{t_i}f)(x_i)} of its time evolution f_t = (A_t f)(x), where {A_t}_{t∈T} is a known evolution operator and {(xi,ti)} ⊂ R^d × R^+ .

Another example is a system identification problem when both f and the evolution family {A_t}_{t∈T} are unknown.  Applications of dynamical sampling include inverse problems in sensor networks, and source term recovery from physically driven evolutionary systems. Dynamical sampling problems are tightly connected to frame theory as well as more classical areas of mathematics such as approximation theory, and functional analysis.  In this talk, I will describe a few problems in dynamical sampling, some results and open problems.

May 20: Philipp Grohs (University of Vienna)

A Proof of the Theory to Practice Gap in Deep Learning (Watch the Video

It is now well understood that the approximation properties of neural networks surpass those of virtually all other known approximation methods.  On the other hand the numerical realization of these theoretical approximation properties seems quite challenging. Recent empirical results suggest that there might exist a tradeoff between the superior approximation properties and efficient algorithmic realizability of neural network based methods. We prove the existence of this tradeoff by establishing hardness results for the problems of approximation and integration on neural network approximation spaces. These results in particular show that high approximation rates cannot be algorithmically realized by commonly used algorithms such as stochastic gradient descent and its variants. (joint work with Felix Voigtlaender)

May 13: Simone Brugiapaglia (Concordia University)

The curse of dimensionality and the blessings of sparsity and Monte Carlo sampling: From polynomial to deep neural network approximation in high dimensions (Watch the Video)

Approximating multi-dimensional functions from pointwise samples is a ubiquitous task in data science and scientific computing. This task is made intrinsically difficult by the presence of four contemporary challenges: (i) the target function is usually defined over a high- or infinite-dimensional domain; (ii) generating samples can be very expensive; (iii) samples are corrupted by unknown sources of errors; (iv) the target function might take values in a function space. In this talk, we will show how these challenges can be substantially overcome thanks to the "blessings" of sparsity and Monte Carlo sampling.

First, we will consider the case of sparse polynomial approximation via compressed sensing. Focusing on the case where the target function is smooth (e.g., holomorphic), but possibly highly anisotropic, we will show how to obtain sample complexity bounds only mildly affected by the curse of dimensionality, near-optimal accuracy guarantees, stability to unknown errors corrupting the data, and rigorous convergence rates of algebraic and exponential type.

Then, we will illustrate how the mathematical toolkit of sparse polynomial approximation via compressed sensing can be employed to obtain a practical existence theorem for Deep Neural Network (DNN) approximation of high-dimensional Hilbert-valued functions. This result shows not only the existence of a DNN with desirable approximation properties, but also how to compute it via a suitable training procedure in order to achieve best-in-class performance guarantees.

We will conclude by discussing open research questions.

The presentation is based on joint work with Ben Adcock, Casie Bao, Nick Dexter, Sebastian Moraga, and Clayton G. Webster.

May 6: Michael Kwok-Po Ng (The University of Hong Kong)

Low Rank Tensor Completion with Poisson Observations (Watch the Video)

Poisson observations for videos are important models in video processing and computer vision. In this talk, we study the third-order tensor completion problem with Poisson observations. The main aim is to recover a tensor based on a small number of its Poisson observation entries. An existing matrix-based method may be applied to this problem via the matricized version of the tensor. However, this method does not leverage on the global low-rankness of a tensor and may be substantially suboptimal. We employ a transformed tensor nuclear norm ball constraint and a bounded constraint of each entry, where the transformed tensor nuclear norm is used to get a lower transformed multi-rank tensor with suitable unitary transformation matrices. We show that the upper bound of the error of the estimator of the proposed model is less than that of the existing matrix-based method. Numerical experiments on synthetic data and real-world datasets are presented to demonstrate the effectiveness of our proposed model compared with existing tensor completion methods.

April 29: Anne Gelb (Dartmouth College)

Empirical Bayesian Inference using Joint Sparsity (Watch the Video)

We develop a new empirical Bayesian inference algorithm for solving a linear inverse problem given multiple measurement vectors (MMV) of under-sampled and noisy observable data.  Specifically, by exploiting the joint sparsity across the multiple measurements in the sparse domain of the underlying signal or image,  we construct a new  support informed sparsity promoting prior. Several applications can be modeled using this framework. Our numerical experiments demonstrate that using this new prior not only improves accuracy of the recovery, but also reduces the uncertainty in the posterior when compared to standard sparsity producing priors.

This is joint work with Theresa Scarnati formerly of the Air Force Research Lab Wright Patterson and now working at Qualis Corporation in Huntsville, AL, and Jack Zhang, recent bachelor degree recipient at Dartmouth College and now enrolled at University of Minnesota’s PhD program in mathematics.

April 22: Shahar Mendelson (University of Vienna and the Australian National University)

Approximating L_p balls via sampling 

Let X be a centred random vector in R^n. The L_p norms that X endows on R^n are defined by \|v\|_{L_p}= (E|<X,v>|^p)^{1/p}. The goal is to approximate those L_p norms, and the given data consists of N independent sample points X_1,...,X_N distributed as X. More accurately, one would like to construct data-dependent functionals \phi_{p,\epsilon} which satisfy with (very) high probability, that for every v in R^n, (1-\epsilon) \phi_{p,\epsilon}  \leq E|<X,v>|^p \leq (1+\epsilon)  \phi_{p,\epsilon}.

I will show that the functionals \frac{1}{N}\sum_{j \in J} |<X_j,v>|^p are a good choice, where the set of indices J is obtained from \{1,...,N\} by removing the  c\eps^2 N largest values of  |<X_j,v>|. Under mild assumptions on X, only N=(c^p)\epsilon^{-2} n measurements are required, and the probability that the functional performs well is at least 1-2\exp(-c\epsilon^2 N).

April 15: Michael Wakin (Colorado School of Mines)

Spectral Properties of Time-limited Toeplitz Operators and Applications in Signal Processing (Watch the Video)

Toeplitz operators are fundamental and ubiquitous in signal processing and information theory as models for convolutional (filtering) systems. Due to the fact that any practical system can access only signals of finite duration, however, time-limited restrictions of Toeplitz operators are also of interest. In the discrete-time case, time-limited Toeplitz operators are simply Toeplitz matrices. In this talk we survey existing and present new bounds on the eigenvalues (spectra) of time-limited Toeplitz operators, and we discuss applications of these results in various signal processing contexts. As a special case, we discuss time-frequency limiting operators, which alternatingly limit a signal in the time and frequency domains. Slepian functions arise as eigenfunctions of these operators, and we describe applications of Slepian functions in spectral analysis of multiband signals, super-resolution SAR imaging, and blind beamforming in antenna arrays. This talk draws from joint work with numerous collaborators including Zhihui Zhu from the University of Denver.

April 8: Mahdi Soltanolkotabi (University of Southern California)

Precise Tradeoffs for Adversarial Training (Watch the Video)

Despite breakthrough performance, modern learning models are known to be highly vulnerable to small adversarial perturbations in their inputs. While a wide variety of recent adversarial training methods have been effective at improving robustness to perturbed inputs (robust accuracy), often this benefit is accompanied by a decrease in accuracy on benign inputs (standard accuracy), leading to a tradeoff between often competing objectives. Complicating matters further, recent empirical evidence suggests that a variety of other factors (size and quality of training data, model size, etc.) affect this tradeoff in somewhat surprising ways. In this talk we will provide a precise and comprehensive understanding of the role of adversarial training in the context of linear regression with Gaussian features and binary classification in a mixture model. We precisely characterize the standard/robust accuracy and the corresponding tradeoff achieved by a contemporary mini-max adversarial training approach in a high-dimensional regime where the number of data points and the parameters of the model grow in proportion to each other. Our theory for adversarial training algorithms also facilitates the rigorous study of how a variety of factors (size and quality of training data, model overparametrization etc.) affect the tradeoff between these two competing accuracies.

April 1: Yi Ma (University of California, Berkeley)

Deep Networks from First Principles (Watch the Video)

In this talk, we offer an entirely “white box’’ interpretation of deep (convolution) networks from the perspective of data compression (and group invariance). In particular, we show how modern deep layered architectures, linear (convolution) operators and nonlinear activations, and even all parameters can be derived from the principle of maximizing rate reduction (with group invariance). All layers, operators, and parameters of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All components of so-obtained network, called ReduNet, have precise optimization, geometric, and statistical interpretation. There are also several nice surprises from this principled approach: it reveals a fundamental tradeoff between invariance and sparsity for class separability; it reveals a fundamental connection between deep networks and Fourier transform for group invariance – the computational advantage in the spectral domain (why spiking neurons?); this approach also clarifies the mathematical role of forward propagation (optimization) and backward propagation (variation). In particular, the so-obtained ReduNet is amenable to fine-tuning via both forward and backward (stochastic) propagation, both for optimizing the same objective. 

This is joint work with students Yaodong Yu, Ryan Chan, Haozhi Qi of Berkeley, Dr. Chong You now at Google Research, and Professor John Wright of Columbia University. 

March 25: Rachel Ward (University of Texas at Austin)

Function Approximation via Sparse Random Features (Watch the Video)

Random feature methods have been successful in various machine learning tasks, are easy to compute, and come with theoretical accuracy bounds. They serve as an alternative approach to standard neural networks since they can represent similar function spaces without a costly training phase. However, for accuracy, random feature methods require more measurements than trainable parameters, limiting their use for data-scarce applications or problems in scientific machine learning. This paper introduces the sparse random feature method that learns parsimonious random feature models utilizing techniques from compressive sensing. We provide uniform bounds on the approximation error for functions in a reproducing kernel Hilbert space depending on the number of samples and the distribution of features. The error bounds improve with additional structural conditions, such as coordinate sparsity, compact clusters of the spectrum, or rapid spectral decay. We show that the sparse random feature method outperforms shallow networks for well-structured functions and applications to scientific machine learning tasks.

March 18: Roberto Imbuzeiro Oliveira (IMPA, Rio de Janeiro)

Sample average approximation with heavier tails (Watch the Video)

Consider an "ideal" optimization problem where constraints and objective function are defined in terms of expectations over some distribution P. The sample average approximation (SAA) -- a fundamental idea in stochastic optimization -- consists of replacing the expectations by an average over a sample from P.  A key question is how much the solutions of the SAA differ from those of the original problem. Results by Shapiro from many years ago consider what happens asymptotically when the sample size diverges, especially when the solution of the ideal problem lies on the boundary of the feasible set. In joint work with Philip Thompson (Purdue), we consider what happens with finite samples.  As we will see, our results improve upon the nonasymptotic state of the art in various ways: we allow for heavier tails, unbounded feasible sets, and obtain bounds that  (in favorable cases) only depend on the geometry of the feasible set in a small neighborhood of the optimal solution. Our results combine "localization" and "fixed-point" type arguments inpired by the work of Mendelson with chaining-type inequalities. One of our contributions is showing what can be said when the SAA constraints are random.

March 11: Jianfeng Cai (Hong Kong University of Science and Technology)

Landscape analysis of non-convex optimizations in phase retrieval (Watch the Video)

Non-convex optimization is a ubiquitous tool in scientific and engineering research. For many important problems, simple non-convex optimization algorithms often provide good solutions efficiently and effectively, despite possible local minima. One way to explain the success of these algorithms is through the global landscape analysis. In this talk, we present some results along with this direction for phase retrieval. The main results are, for several of non-convex optimizations in phase retrieval, a local minimum is also global and all other critical points have a negative directional curvature. The results not only explain why simple non-convex algorithms usually find a global minimizer for phase retrieval, but also are useful for developing new efficient algorithms with a theoretical guarantee by applying algorithms that are guaranteed to find a local minimum.

March 4: Ronald DeVore (Texas A&M University) 

Deep Learning and Neural Networks: The Mathematical View (Watch the Video)

Deep Learning is much publicized and has had great empirical success on challenging  problems in learning.  Yet there is no quantifiable proof of performance and certified guarantees for these methods.  This talk will give an overview of Deep Learning from the viewpoint of mathematics and numerical computation.

February 25: Zaiwen Wen (Peking University, China)

Stochastic Second-Order Methods For Deep Learning (Watch the Video)

Stochastic methods are widely used in deep learning. In this talk, we first review the state-of-the-art methods such as KFAC. Then we present a structured stochastic quasi-Newton method and a sketchy empirical natural gradient method. Numerical results on deep convolution networks illustrate that our methods are quite competitive to SGD and KFAC.

February 18: Mikhail Belkin (University of California, San Diego)

A theory of optimization and transition to linearity in  deep learning (Watch the Video)

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. In this talk I will discuss some general mathematical principles allowing for efficient optimization in over-parameterized non-linear systems, a  setting that includes deep neural networks.  Remarkably, it seems that optimization of such systems is "easy".  In particular,  optimization problems corresponding to these systems are not convex, even locally,  but instead satisfy locally the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD. We connect the PL condition of these systems to the condition number associated to the tangent kernel and develop a non-linear theory parallel to classical analyses of over-parameterized linear equations. 

In a related but conceptually separate development, I will discuss a new perspective on the remarkable recently discovered phenomenon of transition to linearity (constancy of NTK) in certain classes of large neural networks. I will show how this transition to linearity results from the scaling of the Hessian with the size of the network.  

Joint work with Chaoyue Liu and Libin Zhu

February 11: Massimo Fornasier (Technische Universität München)

Consensus-based Optimization on the Sphere (Watch the Video)

I present new stochastic multi-particle models for global optimization of nonconvex functions on the sphere. These models belong to the class of Consensus-Based Optimization methods. In fact, particles move over the manifold driven by a drift towards an instantaneous consensus point, computed as a combination of the particle locations weighted by the cost function according to Laplace's principle. The consensus point represents an approximation to a global minimizer. The dynamics is further perturbed by a random vector field to favor exploration, whose variance is a function of the distance of the particles to the consensus point. In particular, as soon as the consensus is reached, then the stochastic component vanishes. In the first part of the talk, I present the well-posedness of the model on the sphere and we derive rigorously its mean-field approximation for large particle limit.

In the second part I address the proof of convergence of numerical schemes to global minimizers provided conditions of well-preparation of the initial datum. The proof combines the mean-field limit with a novel asymptotic analysis, and classical convergence results of numerical methods for SDE. We present several numerical experiments, which show that the proposed algorithm scales well with the dimension and is extremely versatile. To quantify the performances of the new approach, we show that the algorithm is able to perform essentially as good as ad hoc state of the art methods in challenging problems in signal processing and machine learning, namely the phase retrieval problem and the robust subspace detection.

Joint work with H. Huang, L. Pareschi, and P. Sünnen

February 4: Tino Ullrich (TU Chemnitz)

A New Subsampling Technique for Random Points and Optimal Least Squares Approximation of High-Dimensional Functions (Watch the Video)

We provide a new general  upper bound for the minimal L2-worst-case recovery error in the framework of RKHS, where only n function samples are allowed.  This quantity can be bounded in terms of the singular numbers of the compact embedding into the space of square integrable functions.  It turns out that in many relevant situations this quantity is asymptotically only worse by square root of log(n) compared to the singular numbers. The algorithm which realizes this behavior is a weighted least squares algorithm based on a specific set of sampling nodes which works for the whole class of functions simultaneously.  These points are constructed out of a random draw with respect to distribution tailored to the spectral properties of the reproducing kernel (importance sampling) in combination with a sub-sampling procedure coming from the celebrated proof of Weaver's conjecture, which was shown to be equivalent to the Kadison-Singer problem.  For the above multivariate setting, it is still a fundamental open problem whether sampling algorithms are as powerful as algorithms allowing general linear information like Fourier or wavelet coefficients.  However, the gap is now rather small.  As a consequence, we may study well-known scenarios where it was widely believed that sparse grid sampling recovery methods perform optimally.  It turns out that this is not the case  for dimensions d > 2. 

This is joint work with N. Nagel and M. Schaefer from TU Chemnitz.

January 28: Zuowei Shen (National University of Singapore)

Deep Approximation via Deep Learning (Watch the Video)

The primary task of many applications is approximating/estimating a function through samples drawn from a probability distribution on the input space. The deep approximation is to approximate a function by compositions of many layers of simple functions, that can be viewed as a series of nested feature extractors. The key idea of deep learning network is to convert layers of compositions to layers of tuneable parameters that can be adjusted through a learning process, so that it achieves a good approximation with respect to the input data. In this talk, we shall discuss mathematical theory behind this new approach and approximation rate of deep network; how this new approach differs from the classic approximation theory, and how this new theory can be used to understand and design deep learning networks.

January 21: Daniel Kane (University of California, San Diego)

Point Location and Active Learning (Watch the Video)

In the point location problem one is given a hyperplane arrangement and an unknown point. By making linear queries about that point one wants to determine which cell of the hyperplane arrangement it lies in. This problem has an unexpected connection to the problem in machine learning of actively learning a halfspace. We discuss these problems and their relationship and provide a new and nearly optimal algorithm for solving them.

January 14: Raymond Chan (City University of Hong Kong)

High-Resolution Phase Reconstruction in Ground-based Astronomy (Watch the Video)

Adaptive optics is a commonly used technique to correct the phase distortions caused by the Earth's atmosphere to improve the image quality of the ground-based imaging systems. However, the observed images still suffer from the blur caused by the adaptive optics residual wavefront. We propose a model for reconstructing the residual phase in high-resolution from a sequence of low-resolution deformable mirror data. Our model is based on the turbulence statistics and the Taylor frozen flow hypothesis. A tomography problem for the phase distortions from different altitudes is solved to get a high-quality phase reconstruction. The associated joint optimization problem was solved by an alternating minimization method which results in a high-resolution reconstruction algorithm with adaptive wind velocities. Numerical simulations are carried out to show the effectiveness of our approach. 

Joint work with Rihuan Ke (University of Cambridge), Roland Wagner, and Ronny Ramlau (RICAM).

January 7, 2021:  Jose Perea (Michigan State University)

Learning functions on the space of persistence diagrams (Watch the Video)

The persistence diagram is an increasingly useful shape descriptor from Topological Data Analysis, but its use alongside typical machine learning techniques requires mathematical finesse.  We will describe in this talk a mathematical framework for featurization of said descriptors, and we show how it addresses the problem of approximating continuous functions on compact subsets of the space of persistence diagrams.  We will also show how these techniques can be applied to problems in semi-supervised learning where these descriptors are relevant.

December 17, 2020:  David P. Woodruff (Carnegie Mellon University)

A Very Sketchy Talk (Watch the Video)

We give an overview of dimensionality reduction methods, or sketching, for a number of problems in optimization, first surveying work using these methods for classical problems, which gives near optimal algorithms for regression, low rank approximation, and natural variants. We then survey recent work applying sketching to column subset selection, kernel methods, sublinear algorithms for structured matrices, tensors, trace estimation, and so on. The focus in the talk will be on fast algorithms.

December 10:  Anna Little (University of Utah)

Clustering High-dimensional Data with Path Metrics: A Balance of Density and Geometry (Watch the Video)

This talk discusses multiple methods for clustering high-dimensional data, and explores the delicate balance between utilizing data density and data geometry. I will first present path-based spectral clustering, a novel approach which combines a density-based metric with graph-based clustering. This density-based path metric allows for fast algorithms and strong theoretical guarantees when clusters concentrate around low-dimensional sets. However, the method suffers from a loss of geometric information, information which is preserved by simple linear dimension reduction methods such as classic multidimensional scaling (CMDS). The second part of the talk will explore when CMDS followed by a simple clustering algorithm can exactly recover all cluster labels with high probability. However, scaling conditions become increasingly restrictive as the ambient dimension increases, and the method will fail for irregularly shaped clusters. Finally, I will discuss how a more general family of path metrics, combined with MDS, give low-dimensional embeddings which respect both data density and data geometry. This new method exhibits promising performance on single cell RNA sequence data and can be computed efficiently by restriction to a sparse graph.

December 3:  Ding-Xuan Zhou (City University of Hong Kong)

Theory of Deep Convolutional Neural Networks (Watch the Video)

Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, natural language processing, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoretical foundation for understanding the modelling, approximation or generalization ability of deep learning models with network architectures. Here we are interested in deep convolutional neural networks (CNNs) with convolutional structures. The convolutional architecture gives essential differences between deep CNNs and fully-connected neural networks, and the classical approximation theory for fully-connected networks developed around 30 years ago does not apply. This talk describes an approximation theory of deep CNNs associated with the rectified linear unit (ReLU) activation function. In particular, we prove the universality of deep CNNs, meaning that a deep CNN can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. We also show that deep CNNs perform at least as well as fully-connected neural networks for approximating general functions, and much better for approximating radial functions in high dimensions.

November 26:  Man-Cho Anthony So (Chinese University of Hong Kong)

Non-Convex Orthogonal Group Synchronization via the Generalized Power Method (Watch the Video)

The group synchronization problem calls for the estimation of a ground-truth vector from the noisy relative transforms of its elements, where the elements come from a group and the relative transforms are computed using the binary operation of the group. Such a problem provides an abstraction of a wide range of inverse problems that arise in practice. However, in many instances, one needs to tackle a non-convex optimization formulation. It turns out that for synchronization problems over certain subgroups of the orthogonal group, a simple projected gradient-type algorithm, often referred to as the generalized power method (GPM), is quite effective in finding the ground-truth when applied to their non-convex formulations. In this talk, we survey the related recent results in the literature and focus in particular on the techniques for analyzing the statistical and optimization performance of the GPM.

This talk covers joint works with Huikang Liu, Peng Wang, Man-Chung Yue, and Zirui Zhou.

November 19:  Yonina Eldar (Weizmann Institute of Science)

Deep Analog-to-Digital Compression:  Tasks, Structures, and Models (Watch the Video)

The famous Shannon-Nyquist theorem has become a landmark in the development of digital signal and image processing. However, in many modern applications, the signal bandwidths have increased tremendously, while the acquisition capabilities have not scaled sufficiently fast. Consequently, conversion to digital has become a serious bottleneck.  Furthermore, the resulting digital data requires storage, communication and processing at very high rates which is computationally expensive and requires large amounts of power.  In the context of medical imaging sampling at high rates often translates to high radiation dosages, increased scanning times, bulky medical devices, and limited resolution.

In this talk, we present a framework for sampling and processing a large class of wideband analog signals at rates far below Nyquist in space, time and frequency, which allows to dramatically reduce the number of antennas, sampling rates and band occupancy.  Our framework relies on exploiting signal structure and the processing task.  We consider applications of these concepts to a variety of problems in communications, radar and ultrasound imaging and show several demos of real-time sub-Nyquist prototypes including a wireless ultrasound probe, sub-Nyquist MIMO radar, super-resolution in microscopy and ultrasound, cognitive radio, and joint radar and communication systems. We then discuss how the ideas of exploiting the task, structure and model can be used to develop interpretable model-based deep learning methods that can adapt to existing structure and are trained from small amounts of data. These networks achieve a more favorable trade-off between increase in parameters and data and improvement in performance, while remaining interpretable.

November 12:  Hrushikesh Mhaskar (Claremont Graduate University)

Revisiting the theory of machine learning (Watch the Video)

A central problem of machine learning is the following. Given data of the form (y_i, f(y_i) + ϵ_i)_{i = 1}^M, where y_i’s are drawn randomly from an unknown (marginal) distribution μ*  and ϵ_i  are random noise variables from another unknown distribution, find an approximation to the unknown function f, and estimate the error in terms of M. The approximation is accomplished typically by neural/rbf/kernel networks, where the number of nonlinear units is determined on the basis of an estimate on the degree of approximation, but the actual approximation is computed using an optimization algorithm. Although this paradigm is obviously extremely successful, we point out a number of  perceived theoretical shortcomings of this paradigm, the perception reinforced by some recent observations about deep learning.  We  describe our  efforts to overcome these shortcomings and develop a more direct and elegant approach based on the principles of approximation theory and harmonic analysis.

October 29:  Yang Wang (Hong Kong University of Science and Technology)

     Improving Generative Adversarial Nets (GAN) Using VGrow (Watch the Video)

Generative Adversarial Nets (GAN) have been one of the most exciting developments in machine learning and AI. But training of GAN is highly nontrivial. In this talk I will give an introduction to GAN, and propose a framework to learn deep generative models via Variational Gradient Flow (Vgrow) on probability measure spaces. Connections of our proposed VGrow method with other popular methods, such as VAE, GAN and flow-based methods, have been established in this framework, gaining new insights of deep generative learning.

October 22:  Giuseppe Caire (TU Berlin)

The Mathematics of Massive Random Access Communications (Watch the Video)

The Multiple Access Channel (MAC) is one of the most well studied and understood network information theoretic models, describing a scenario where K users wish to deliver their information message to one receiver, sharing the same transmission channel. Beyond the multitude of practical and somwho heuristic MAC protocols (e.g., TDMA, FDMA, CDMA, CSMA, Aloha, and variations thereof), the information theoretic capacity region is well understood under many situations of interest, and in particular in the Gaussian case, modeling the uplink of a wireless system with one access point or base station (receiver) and several users, sharing the same frequency band. 

More recently, a variant of this model has been proposed for a situation where a very large (virtually unlimited) number of users wish to communicate only very sporadically, such that at any point in time only a finite and relatively small number of users are ``active''. This scenario is appropriate for machine-type communications and Internet of Things, where a multitude of sensors and objects have only rather sporadic data to send, but they need to send them when they are created, at random times. 

The identification of the active user set, or the active message set (the list of messages transmitted, irrespectively of who is transmitting them) has some points in common with a compressed sensing problem, where the activity vector (entry 1 if a user/message is active and 0 otherwise) is the key object to be estimated at the receiver side.  

In this talk we review the basic MAC model and results, a variant for massive random access called ``unsourced random access'' where all users use the same codebook, and related recent results and algorithms. including some capacity scaling for this model, which remains quite open as far as a full information theoretic characterization is concerned.

October 15:  Reinhard Heckel (Technical University of Munich)

Provable Image Recovery with Untrained Convolutional Neural Networks (Watch the Video)

Convolutional Neural Networks are highly successful tools for image recovery and restoration. A major contributing factor to this success is that convolutional networks impose strong prior assumptions about natural images—so strong that they enable image recovery without any training data. A surprising observation that highlights those prior assumptions is that one can remove noise from a corrupted natural image by simply fitting (via gradient descent) a randomly initialized, over-parameterized convolutional generator to the noisy image.  In this talk, we discuss a simple un-trained convolutional network, called the deep decoder, that provably enables image denoising and regularization of inverse problems such as compressive sensing with excellent performance. We formally characterize the dynamics of fitting this convolutional network to a noisy signal and to an under-sampled signal, and show that in both cases early-stopped gradient descent provably recovers the clean signal.  Finally, we discuss our own numerical results and numerical results from another group demonstrating that un-trained convolutional networks enable magnetic resonance imaging from highly under-sampled measurements, achieving results surprisingly close to trained networks, and outperforming classical untrained methods.

October 8:  Jun Kitagawa (Michigan State University)

Optimal transport with storage fees: theory and numerics (Watch the Video)

In this talk I will discuss the optimal transport problem with ''storage fees.''  This is a variant of the semi-discrete optimal transport (a.k.a. Monge-Kantorovich) problem, where instead of transporting an absolutely continuous measure to a fixed discrete measure and minimizing the transport cost, one must choose the weights of the target measure, and minimize the sum of the transport cost and some given ''storage fee function'' of the target weights. This problem arises in queue penalization and semi-supervised data clustering. I will discuss some basic theoretical questions, such as existence, uniqueness, a dual problem, and characterization of solutions. Then, I will present a numerical algorithm which has global linear and local superlinear convergence for a subcase of storage fee functions.  All work in this talk is joint with M. Bansil (UCLA).

October 1:  René Vidal (Johns Hopkins University)

On the Regularization Properties of Structured Dropout (Watch the Video)

Dropout and its extensions (e.g. DropBlock and DropConnect) are popular heuristics for training neural networks, which have been shown to improve generalization performance in practice. However, a theoretical understanding of their optimization and regularization properties remains elusive. This talk will present a theoretical analysis of several dropout-like regularization strategies, all of which can be understood as stochastic gradient descent methods for minimizing a certain regularized loss. In the case of single hidden-layer linear networks, we will show that Dropout and DropBlock induce nuclear norm and spectral k-support norm regularization, respectively, which promote solutions that are low-rank and balanced (i.e. have factors with equal norm). We will also show that the global minimizer for Dropout and DropBlock can be computed in closed form, and that DropConnect is equivalent to Dropout. We will then show that some of these results can be extended to a general class of Dropout-strategies, and, with some assumptions, to deep non-linear networks when Dropout is applied to the last layer.

September 24:  Rebecca Willett (University of Chicago)

Regularization in Infinite-Width ReLU Networks (Watch the Video)

A growing body of research illustrates that neural network generalization performance is less dependent on the network size (i.e. number of weights or parameters) and more dependent on the magnitude of the weights. That is, generalization is not achieved by limiting the size of the network, but rather by explicitly or implicitly controlling the magnitude of the weights. To better understand this phenomenon, we will explore how neural networks represent functions as the number of weights in the network approaches infinity. Specifically, we characterize the norm required to realize a function f as a single hidden-layer ReLU network with an unbounded number of units (infinite width), but where the Euclidean norm of the weights is bounded, including precisely characterizing which functions can be realized with finite norm. This was settled for univariate functions in Savarese et al. (2019), where it was shown that the required norm is determined by the L1-norm of the second derivative of the function. We extend the characterization to multivariate functions (i.e., networks with d input units), relating the required norm to the L1-norm of the Radon transform of a (d+1)/2-power Laplacian of the function. This characterization allows us to show that all functions in certain Sobolev spaces can be represented with bounded norm and to obtain a depth separation result. These results have important implications for understanding generalization performance and the distinction between neural networks and more traditional kernel learning. 

September 17:  Mauro Maggioni (Johns Hopkins University)

Learning Interaction laws in particle- and agent-based systems (Watch the Video)

Interacting agent-based systems are ubiquitous in science, from modeling of particles in Physics to prey-predator and colony models in Biology, to opinion dynamics in economics and social sciences. Oftentimes the laws of interactions between the agents are quite simple, for example they depend only on pairwise interactions, and only on pairwise distance in each interaction. We consider the following inference problem for a system of interacting particles or agents: given only observed trajectories of the agents in the system, can we learn what the laws of interactions are? We would like to do this without assuming any particular form for the interaction laws, i.e. they might be “any” function of pairwise distances. We consider this problem both the mean-field limit (i.e. the number of particles going to infinity) and in the case of a finite number of agents, with an increasing number of observations, albeit in this talk we will mostly focus on the latter case. We cast this as an inverse problem, and study it in the case where the interaction is governed by an (unknown) function of pairwise distances. We discuss when this problem is well-posed, and we construct estimators for the interaction kernels with provably good statistically and computational properties. We measure their performance on various examples, that include extensions to agent systems with different types of agents, second-order systems, and families of systems with parametric interaction kernels. We also conduct numerical experiments to test the large time behavior of these systems, especially in the cases where they exhibit emergent behavior. This is joint work with F. Lu, J.Miller, S. Tang and M. Zhong.

September 10:  Rima Alaifari (ETH Zürich)

The phase factor: recovery from magnitudes of signal representations (Watch the Video)

We study the problem of phase retrieval from a deterministic viewpoint, in which the magnitudes of a time-frequency or time-scale representation of a signal are known. From an inverse problems perspective, the questions of uniqueness and stability are crucial to theoretically guarantee meaningful reconstruction. In this talk, we present results on these two questions for Gabor frames and wavelet frames and conclude by discussing some open problems.

September 3:  Daniel Potts (TU Chemnitz)

High dimensional approximation with trigonometric polynomials (Watch the Video)

In this talk, we present fast Fourier based methods for the approximation of multivariate functions. Our aim is to learn the support of the Fourier coefficients in the frequency domain of high-dimensional functions. We are interested in two different approximation scenarios. The first case is black-box approximation where the user is allowed to sample the unknown function at any point and in the second case we are working with fixed scattered data. For black-box approximation we employ quasi Monte-Carlo methods on rank-1 lattice points. The fast algorithms are then based on one-dimensional fast Fourier transforms (FFT). In the second case, which is much more difficult, we will couple truncated ANOVA (analysis of variance) decompositions with the fast Fourier transform on nonequispaced data (NFFT). In both cases, we present error estimates and numerical results. The presented methods can be understood as sparse high dimensional FFT’s.  

This talk based on joint work with Lutz Kämmerer, Michael Schmischke, Manfred Tasche, and Toni Volkmer.

August 27:  Nir Sochen (University of Tel Aviv)

Unsupervised deep learning of forward and inverse solutions for PDE-based imaging (Watch the Video)

Many imaging modalities are based on inverse problems of physical processes that are given as PDEs.  Traditional methods for solving these PDE-based forward and inverse problems are based on discretizations of the domain.  Deep learning methods are based on an excessive amount of input-output pairs.  Both approaches encounter problems  either by numerical instabilities and by being limited to low dimensions or by the lack of sufficient data.  We suggest an alternative method of unsupervised deep learning method were the network parametrizes the solution and the loss function minimizes the deviation from the PDE. The input set are points sampled randomly in the domain and the output is the deviation from the PDE, namely zero. One key issue in the loss function is the introduction of the L_infty term that guaranty the uniform convergence of the network to the solution. We demonstrate our method on the Electrical Impedance Tomography (EIT).   

August 20:  Helmut Bölcskei (ETH Zürich)

Fundamental limits of learning in deep neural networks (Watch the Video)

We develop a theory that allows to characterize the fundamental limits of learning in deep neural networks. Concretely, we consider Kolmogorov-optimal approximation through deep neural networks with the guiding theme being a relation between the epsilon-entropy of the hypothesis class to be learned and the complexity of the approximating network in terms of connectivity and memory requirements for storing the network topology and the quantized weights and biases. The theory we develop educes remarkable universality properties of deep networks. Specifically, deep networks can Kolmogorov-optimally learn essentially any hypothesis class. In addition, we find that deep networks provide exponential approximation accuracy—i.e., the approximation error decays exponentially in the number of non-zero weights in the network—of widely different functions including the multiplication operation, polynomials, sinusoidal functions, general smooth functions, and even one-dimensional oscillatory textures and fractal functions such as the Weierstrass function, both of which do not have any known methods achieving exponential approximation accuracy. We also show that in the approximation of sufficiently smooth functions finite-width deep networks require strictly smaller connectivity than finite-depth wide networks. We conclude with an outlook on the further role our theory could play.

August 13:  Dustin Mixon (Ohio State)

Ingredients matter: Quick and easy recipes for estimating clusters, manifolds, and epidemics (Watch the Video)

Data science resembles the culinary arts in the sense that better ingredients allow for better results. We consider three instances of this phenomenon. First, we estimate clusters in graphs, and we find that more signal allows for faster estimation. Here, "signal" refers to having more edges within planted communities than across communities. Next, in the context of manifolds, we find that an informative prior allows for estimates of lower error. In particular, we apply the prior that the unknown manifold enjoys a large, unknown symmetry group. Finally, we consider the problem of estimating parameters in epidemiological models, where we find that a certain diversity of data allows one to design estimation algorithms with provable guarantees. In this case, data diversity refers to certain combinatorial features of the social network. Joint work with Jameson Cahill, Charles Clum, Hans Parshall, and Kaiying Xie.

August 6:  Tselil Schramm (MIT)

Reconciling Statistical Queries and the Low Degree Likelihood Ratio (Watch the Video)

In many high-dimensional statistics problems, we observe information-computation tradeoffs: given access to more data, statistical estimation and inference tasks require fewer computational resources. Though this phenomenon is ubiquitous, we lack rigorous evidence that it is inherent. In the current day, to prove that a statistical estimation task is computationally intractable, researchers must prove lower bounds against each type of algorithm, one by one, resulting in a "proliferation of lower bounds". We scientists dream of a more general theory which unifies these lower bounds and explains computational intractability in an algorithm-independent way.

In this talk, I will make one small step towards realizing this dream. I will demonstrate general conditions under which two popular frameworks yield the same information-computation tradeoffs for high-dimensional hypothesis testing: the first being statistical queries in the "SDA" framework, and the second being hypothesis testing with low-degree hypothesis tests, also known as the low-degree-likelihood ratio. Our equivalence theorems capture numerous well-studied high-dimensional learning problems: sparse PCA, tensor PCA, community detection, planted clique, and more.

Based on joint work with Matthew Brennan, Guy Bresler, Samuel B. Hopkins and Jerry Li.

July 30:  Mary Wootters (Stanford University)

Sparse Recovery for Orthogonal Polynomial Transforms (Watch the Video)

Consider the following sparse recovery problem. We have query access to a vector x in R^N such that \hat{x} = Fx is k-sparse (or nearly k-sparse) for some orthogonal transform F.  The goal is to output an approximation to \hat{x} in sublinear time.  This problem has been well-studied in the special case that F is the Discrete Fourier Transform (DFT), and a long line of work has resulted in sparse Fast Fourier Transforms that run in time O(k polylog(N)).  However, for transforms F other than the DFT (or closely related transforms like the Discrete Cosine Transform), the question is much less settled.

In this talk I'll describe sublinear-time algorithms---running in time poly(k log N)---for solving the sparse recovery problem for orthogonal transforms F that arise from orthogonal polynomials.  More precisely, our algorithm works for any F that is an orthogonal polynomial transform derived from Jacobi polynomials.  The Jacobi polynomials are a large class of classical orthogonal polynomials (and include Chebyshev and Legendre polynomials as special cases), and show up extensively in applications like numerical analysis and signal processing.  One caveat of our work is that we require an assumption on the sparsity structure of the sparse vector, although we note that vectors with random support have this property with high probability.

Our approach is to give a very general reduction from the k-sparse sparse recovery problem to the 1-sparse sparse recovery problem, which works for any "flat" orthogonal polynomial transform.

This is joint work with Anna Gilbert, Albert Gu, Chris Re, and Atri Rudra.

July 23:  Tamara Kolda (Sandia National Laboratories)

Practical Leverage-Based Sampling for Low-Rank Tensor Decomposition (Watch the Video)

Conventional algorithms for finding low-rank canonical polyadic (CP) tensor decompositions are unwieldy for large sparse tensors. The CP decomposition can be computed by solving a sequence of overdetermined least problems with special Khatri-Rao structure. In this work, we present an application of randomized algorithms to fitting the CP decomposition of sparse tensors, solving a significantly smaller sampled least squares problem at each iteration with probabilistic guarantees on the approximation errors. Prior work has shown that sketching is effective in the dense case, but the prior approach cannot be applied to the sparse case because a fast Johnson-Lindenstrauss transform (e.g., using a fast Fourier transform) must be applied in each mode, causing the sparse tensor to become dense. Instead, we perform sketching through leverage score sampling, crucially relying on the fact that the structure of the Khatri-Rao product allows sampling from overestimates of the leverage scores without forming the full product or the corresponding probabilities. Naive application of leverage score sampling is ineffective because we often have cases where a few scores are quite large, so we propose a novel hybrid of deterministic and random leverage-score sampling which consistently yields improved fits. Numerical results on real-world large-scale tensors show the method is significantly faster than competing methods without sacrificing accuracy.  This is joint work with Brett Larsen, Stanford University.

July 16:  Caroline Uhler (MIT & ETH Zurich)

Multi-Domain Data Integration: From Observations to Mechanistic Insights (Watch the Video)

Massive data collection holds the promise of a better understanding of complex phenomena and ultimately, of better decisions. An exciting opportunity in this regard stems from the growing availability of perturbation / intervention data (manufacturing, advertisement, education, genomics, etc.). In order to obtain mechanistic insights from such data, a major challenge is the integration of different data modalities (video, audio, interventional, observational, etc.). Using genomics and in particular the problem of identifying drugs for the repurposing against COVID-19 as an example, I will first discuss our recent work on coupling autoencoders in the latent space to integrate and translate between data of very different modalities such as sequencing and imaging. I will then present a framework for integrating observational and interventional data for causal structure discovery and characterize the causal relationships that are identifiable from such data. We end by a theoretical analysis of autoencoders linking overparameterization to memorization. In particular, I will characterize the implicit bias of overparameterized autoencoders and show that such networks trained using standard optimization methods implement associative memory. Collectively, our results have major implications for planning and learning from interventions in various application domains.

July 9:  Holger Rauhut (RWTH Aachen University)

Convergence of gradient flows for learning deep linear neural networks (Watch the Video)

Learning neural networks amounts to minimizing a loss function over given training data.  Often gradient descent algorithms are used for this task, but their convergence properties are not yet well-understood.  In order to make progress we consider the simplified setting of linear networks optimized via gradient flows.  We show that such gradient flow defined with respect to the layers (factors) can be reinterpreted as a Riemannian gradient flow on the manifold of rank-r matrices in certain cases. The gradient flow always converges to a critical point of the underlying loss functional and, for almost all initializations, it converges to a global minimum on the manifold of rank-k matrices for some k.

July 2:  Stéphane Mallat (Collège de France, ENS Paris, Flatiron Institute) 

Descartes versus Bayes: Harmonic Analysis for High Dimensional Learning and Deep Nets  (Watch the Video)

Is high-dimensional learning about function approximation or Bayes probability estimation?  We shall argue that solutions go through finding discriminative variables which concentrate, according to Bayes and statistical physics.  Harmonic analysis gives a mathematical framework to define and analyze such variables from prior information on symmetries.  The results of deep neural network architectures are opening new horizons beyond Fourier, wavelets and sparsity. What is being learned through optimization?  Phase was long forgotten and is making its way back.  This lecture outlines harmonic analysis challenges raised by classification and data generation with deep convolutional neural networks.  We consider applications to image generation and classification, including ImageNet.

June 25:  Richard Baraniuk (Rice University)

Mad Max: Affine Spline Insights into Deep Learning (Watch the Video)

We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classification performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up new geometric avenue to study how DNs organize signals in a hierarchical fashion. To validate the utility of the VQ interpretation, we develop and validate a new distance metric for signals and images that quantifies the difference between their VQ encodings.

June 18:  Gitta Kutyniok (TU Berlin)

Understanding Deep Neural Networks: From Generalization to Interpretability (Watch the Video)

Deep neural networks have recently seen an impressive comeback with applications both in the public sector and the sciences. However, despite their outstanding success, a comprehensive theoretical foundation of deep neural networks is still missing.

For deriving a theoretical understanding of deep neural networks, one main goal is to analyze their generalization ability, i.e. their performance on unseen data sets. In case of graph convolutional neural networks, which are today heavily used, for instance, for recommender systems, already the generalization capability to signals on graphs unseen in the training set, typically coined transferability, was not rigorously analyzed. In this talk, we will prove that spectral graph convolutional neural networks are indeed transferable, thereby also debunking a common misconception about this type of graph convolutional neural networks.

If such theoretical approaches fail or if one is just given a trained neural network without knowledge of how it was trained, interpretability approaches become necessary. Those aim to "break open the black box" in the sense of identifying those features from the input, which are most relevant for the observed output. Aiming to derive a theoretically founded approach to this problem, we introduced a novel approach based on rate-distortion theory coined Rate-Distortion Explanation (RDE), which not only provides state-of-the-art explanations, but in addition allows first theoretical insights into the complexity of such problems. In this talk we will discuss this approach and show that it also gives a precise mathematical meaning to the previously vague term of relevant parts of the input.

June 11:  Jelani Nelson (UC Berkeley)

Optimal terminal dimensionality reduction in Euclidean space (Watch the Video)

The Johnson-Lindenstrauss lemma states that for any X a subset of R^d with |X| = n and for any epsilon, there exists a map f:X-->R^m for m = O(log n / epsilon^2) such that: for all x in X, for all y in X, (1-epsilon)|x - y|_2 <= |f(x) - f(y)|_2 <= (1+epsilon)|x - y|_2.  We show that this statement can be strengthened.  In particular, the above claim holds true even if "for all y in X" is replaced with "for all y in R^d".  Joint work with Shyam Narayanan.

June 4:  Ben Adcock (Simon Fraser University)

The troublesome kernel: instabilities in deep learning for inverse problems (Watch the Video)

Due to their stunning success in traditional machine learning applications such as classification, techniques based on deep learning have recently begun to be actively investigated for problems in computational science and engineering. One of the key areas at the forefront of this trend is inverse problems, and specifically, inverse problems in imaging. The last few years have witnessed the emergence of many neural network-based algorithms for important imaging modalities such as MRI and X-ray CT. These claim to achieve competitive, and sometimes even superior, performance to current state-of-the-art techniques.

However, there is a problem. Techniques based on deep learning are typically unstable. For example, small perturbations in the data can lead to a myriad of artefacts in the recovered images. Such artifacts can be hard to dismiss as obviously unphysical, meaning that this phenomenon has potentially serious consequences for the safe deployment of deep learning in practice. In this talk, I will first showcase the instability phenomenon empirically in a range of examples. I will then focus on its mathematical underpinnings, the consequences of these insights when it comes to potential remedies, and the future possibilities for computing genuinely stable neural networks for inverse problems in imaging.

This is joint work with Vegard Antun, Nina M. Gottschling, Anders C. Hansen, Clarice Poon, and Francesco Renna

Papers:

https://www.pnas.org/content/early/2020/05/08/1907377117

https://arxiv.org/abs/2001.01258

May 28:  Ronald Coifman (Yale)

The Analytic Geometries of Data (Watch the Video)

We will describe methodologies to build data geometries designed to simultaneously analyze and process data bases.  The different geometries or affinity metrics arise naturally as we learn to contextualize and conceptualize. I.e; relate data regions, and data features (which we extend to data tensors).  Moreover we generate tensorial multiscale structures.  

We will indicate connections to analysis by deep nets and describe applications to modeling observations of dynamical systems, from stochastic molecular dynamics to calcium imaging of brain activity.

May 21:  Daniel Spielman (Yale)

Balancing covariates in randomized experiments using the Gram–Schmidt walk (Watch the Video)

In randomized experiments, such as a medical trials, we randomly assign the treatment, such as a drug or a placebo, that each experimental subject receives.  Randomization can help us accurately estimate the difference in treatment effects with high probability.  We also know that we want the two groups to be similar: ideally the two groups would be similar in every statistic we can measure beforehand. Recent advances in algorithmic discrepancy theory allow us to divide subjects into groups with similar statistics. 

By exploiting the recent Gram-Schmidt Walk algorithm of Bansal, Dadush, Garg, and Lovett, we can obtain random assignments of low discrepancy.  These allow us to obtain more accurate estimates of treatment effects when the information we measure about the subjects is predictive, while also bounding the worst-case behavior when it is not.

We will explain the experimental design problem we address, the Gram-Schmidt walk algorithm, and the major ideas behind our analyses.  This is joint work with Chris Harshaw, Fredrik Sävje, and Peng Zhang.

Paper: https://arxiv.org/abs/1911.03071

Code: https://github.com/crharshaw/GSWDesign.jl

May 14:  Ilya Razenshteyn (Microsoft Research)

Scalable Nearest Neighbor Search for Optimal Transport (Watch the Video)

The Optimal Transport (aka Wasserstein) distance is an increasingly popular similarity measure for structured data domains, such as images or text documents. This raises the necessity for fast nearest neighbor search with respect to this distance, a problem that poses a substantial computational bottleneck for various tasks on massive datasets. In this talk, I will discuss fast tree-based approximation algorithms for searching nearest neighbors with respect to the Wasserstein-1 distance. I will start with describing a standard tree-based technique, known as QuadTree, which has been previously shown to obtain good results. Then I'll introduce a variant of this algorithm, called FlowTree, and show that it achieves better accuracy, both in theory and in practice. In particular, the accuracy of FlowTree is in line with previous high-accuracy methods, while its running time is much faster.

The talk is based on a joint work with Arturs Backurs, Yihe Dong, Piotr Indyk and Tal Wagner. The paper can be found at https://arxiv.org/abs/1910.04126 and the code -- at https://github.com/ilyaraz/ot_estimators

May 7:  Anna Gilbert (University of Michigan)

Metric representations: Algorithms and Geometry (Watch the Video)

Given a set of distances amongst points, determining what metric representation is most “consistent” with the input distances or the metric that best captures the relevant geometric features of the data is a key step in many machine learning algorithms. In this talk, we focus on 3 specific metric constrained problems, a class of optimization problems with metric constraints: metric nearness (Brickell et al. (2008)), weighted correlation clustering on general graphs (Bansal et al. (2004)), and metric learning (Bellet et al. (2013); Davis et al. (2007)).

Because of the large number of constraints in these problems, however, these and other researchers have been forced to restrict either the kinds of metrics learned or the size of the problem that can be solved. We provide an algorithm, PROJECT AND FORGET, that uses Bregman projections with cutting planes, to solve metric constrained problems with many (possibly exponentially) inequality constraints. We also prove that our algorithm converges to the global optimal solution. Additionally, we show that the optimality error decays asymptotically at an exponential rate. We show that using our method we can solve large problem instances of three types of metric constrained problems, out-performing all state of the art methods with respect to CPU times and problem sizes.

Finally, we discuss the adaptation of PROJECT AND FORGET to specific types of metric constraints, namely tree and hyperbolic metrics.

April 30:  Thomas Strohmer (UC Davis)

Pandemics, Privacy, and Paradoxes - Why We Need a New Paradigm for Data Science and AI (Watch the Video)

Pioneered by giant internet corporations and powered by machine learning, a new economic system is emerging that pushes for relentless data capture and analysis, usually without users' consent. Surveillance capitalism pursues the exploitation and control of human nature, thereby threatening our social fabric. To counter these developments, we need to rethink the role of data science and artificial intelligence. We must urgently develop a new paradigm of what data is. This urgency is aggravated by the current pandemic, which amplifies fundamental paradoxes underlying data science and AI. I will argue that the key lies in understanding the trialectic nature of data, the careful balance of which will be key to tackling the aforementioned disturbing developments, while still reaping the benefits of data science and AI. Based on this trialectic nature, I will draw consequences for the role of mathematics in data science and indicate how mathematicians can directly contribute to a more just digital revolution.

April 23, 2020:  Andrea Bertozzi (UCLA)  

Epidemic modeling – basics and challenges (Watch the Video)

I will review basics of epidemic modeling including exponential growth, compartmental models and self-exciting point process models.  I will illustrate how such models have been used in the past for previous pandemics and what the challenges are for forecasting the current COVID-19 pandemic.  I will show some examples of fitting of data to US states and what one can do with those results.  Overall, model prediction has a degree of uncertainty especially with early time data and with many unknowns.