Research

Variance Reduced Nonconvex Stochastic Optimization for Kernel Learning

In recent times, variance reduction technique has greatly enhanced the performance of stochastic gradient algorithms, providing a new framework for finding first-order critical points in nonconvex problems. Our work considers for the first time, online kernel learning under nonconvex setting, and we propose a variance reduced stochastic optimization algorithm for learning the function belonging to reproducing kernel Hilbert space (RKHS). In contrast to existing techniques, we use a convex combination of two existing biased and unbiased stochastic estimators which eliminates the need for using double loop structure for variance reduction. It is very well known that non-parametric function approximators provide a structured way to learn nonlinear statistical model. However, their function representational complexity grows with the sample size, and for streaming settings it may grow unbounded. Thus, it is crucial to address the statistical accuracy with finite representational complexity. In the proposed algorithm, we have addressed this tradeoff and have characterized the convergence to the neighbourhood of the critical point in terms of number of iterations and the model complexity parameter. The proposed nonparametric variance reduced algorithm reaches to the neighbourhood of the critical point in T iterations along with keeping the model complexity bounded by running a new hybrid compression algorithm.

[Relevant Work: Asilomar 2022]

Near-Optimal Set Point Selection for Kernel Approximation using Submodular Set Cover Approach

Non-parametric function approximators provide a principled way to fit nonlinear statistical models while affording formal performance guarantees. However, they define a statistical representation whose complexity scales with the sample size through the fact that they retain all past samples. In the case of large-scale data, this complexity may grow unbounded. One is faced with the question of how to suitably trade-off representational complexity with statistical accuracy, which may be addressed with various approximation methods. We consider two approaches from the kernel approximation literature: (a) Basis vector projection, and (b) Random Fourier Feature (RFF), and we try to solve two issues still associated with these two approaches. Firstly, it is known that projection-based kernel learning algorithms yield better performance, however, they are limited by their computational efficiency associated with projection. Thus, we work on reducing the basis vector set size to improve the computational efficiency and thereby make the projection-based algorithms more usable. Secondly, to further improve upon the learning efficiency of RFF algorithms, we consider a data-dependent approach for feature selection and work on reducing the number of features required, providing better learning accuracy and generalization. We propose an algorithm that, instead of selecting the features in a stochastic manner, provides a principled way of choosing the features in a data-dependent manner yielding a solution that only picks logarithmically more features than the optimal algorithm. The guarantees of earlier methods are only for the sub-optimality of the set points, but not on whether the function’s parameterization in terms of past points are close to the optimal set of points of a fixed size. We address the two discussed issues by a novel formulation of set point selection problems as Submodular set cover (SSC) problems and establish nice theoretical guarantees inherent to submodular optimization theory.

[Relevant work: ICASSP 2022]

Scalable Online Gaussian Process Regression

Gaussian processes provide a framework for nonlinear nonparametric Bayesian inference widely applicable across science and engineering. Unfortunately, their computational burden scales cubically with the training sample size, which in the case that samples arrive in perpetuity, approaches infinity. This issue necessitates approximations for use with streaming data, which to date mostly lack convergence guarantees. Thus, we develop the first online Gaussian process approximation that preserves convergence to the population posterior, i.e., asymptotic posterior consistency, while ameliorating its intractable complexity growth with the sample size. We propose an online compression scheme that, following each a posteriori update, fixes an error neighborhood with respect to the Hellinger metric centered at the current posterior, and greedily tosses out past kernel dictionary elements until its boundary is hit. We call the resulting method Parsimonious Online Gaussian Processes (POG). For diminishing error radius, exact asymptotic consistency is preserved at the cost of unbounded memory in the limit. On the other hand, for constant error radius, POG converges to a neighborhood of the population posterior but with finite memory at-worst determined by the metric entropy of the feature space. Experimental results are presented on several nonlinear regression problems which illuminates the merits of this approach as compared with alternatives that fix the subspace dimension defining the history of past points.

[Relevant Work: Statistics & Computing 2021]

Distributed Online Kernel Learning under Heterogeneous Network Settings

We consider distributed empirical risk minimization problem learning in heterogeneous networks: agents seek to minimize a convex functional that aggregates data across the network, while only having access to their local data streams. We focus on the case where agents seek to estimate a regression function that belongs to a reproducing kernel Hilbert space (RKHS). To incentivize coordination while respecting network heterogeneity, we impose nonlinear proximity constraints. The resulting constrained stochastic optimization problem is solved using the functional variant of stochastic primal-dual (Arrow-Hurwicz) method which yields a decentralized algorithm. In order to avoid the model complexity from growing linearly over time, we project the primal iterates onto subspaces greedily constructed from kernel evaluations of agents’ local observations. The resulting scheme, dubbed Heterogeneous Adaptive Learning with Kernels (HALK), allows us, for the first time, to characterize the precise trade-off between the optimality gap, constraint violation, and the model complexity. Simulations on a correlated spatio-temporal field estimation validate our theoretical results, which are corroborated in practice for networked oceanic sensing buoys estimating temperature and salinity from depth measurements.

[Relevant work: IEEE TSIPN 2021, Asilomar 2020, IEEE GlobalSIP 2018]