## COST  :  NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning

### Stochastic Algorithms for One-Pass Learning

Leon Bottou, Microsoft Research
The goal of the presentation is to describe practical stochastic gradient algorithms that process each training example only once, yet asymptotically match the performance of the true optimum. This statement needs, of course, to be made more precise. To achieve this, we'll review the works of Nevel'son and Has'minskij (1972), Fabian (1973, 1978), Murata & Amari (1998), Bottou & LeCun (2004), Polyak & Juditsky (1992), Wei Xu (2010), and Bach & Moulines (2011). We will then show how these ideas lead to practical algorithms and new challenges.

### Using More Data to Speed-up Training Time

Shai-Shalev Shwartz, Hebrew University
Recently, there has been a growing interest in understanding how more data can be leveraged to reduce the required training runtime. I will describe a systematic study of the runtime of learning as a function of the number of available training examples, and underscore the main high-level techniques. In particular, a formal positive result will be presented, showing that even in the unrealizable case, the runtime can decrease exponentially while only requiring a polynomial growth of the number of examples. The construction corresponds to a synthetic learning problem and an interesting open question is if and how the tradeoff can be shown for more natural learning problems. I will spell out several interesting candidates of natural learning problems for which we conjecture that there is a tradeoff between computational and sample complexity.
Based on joint work with Ohad Shamir and Eran Tromer.

### Early stopping for non-parametric regression: An optimal data-dependent stopping rule

Garvesh Raskutti, Martin Wainwright and Bin Yu, U C Berkeley
The goal of non-parametric regression is to estimate an unknown function $f^*$ based on $n$ i.i.d. observations of the form $y_i = f^*(x_i) + w_i$, where $\{w_i\}_{i=1}^n$ are additive noise variables. Simply choosing a function to minimize the least-squares loss $\frac{1}{2 n} \sum_{i=1}^{n}{(y_i - f(x_i))^2}$ will lead to overfitting'', so that various estimators are based on different types of regularization. The \emph{early stopping} strategy is to run an iterative algorithm such as gradient descent for a fixed but finite number of iterations. Early stopping is known to yield estimates with better prediction accuracy than those obtained by running the algorithm for an infinite number of iterations. Although bounds on this prediction error are known for certain function classes and step size choices, the bias-variance tradeoffs for arbitrary reproducing kernel Hilbert spaces (RKHSs) and arbitrary choices of step-sizes have not been well-understood to date. In this paper, we derive upper bounds on both the $L^2(\mathbb{P}_n)$ and $L^2(\mathbb{P})$ error for arbitrary RKHSs, and provide an explicit and easily computable data-dependent stopping rule. In particular, it depends only on the sum of step-sizes and the eigenvalues of the empirical kernel matrix for the RKHS. For Sobolev spaces and finite-rank kernel classes, we show that our stopping rule yields estimates that achieve the statistically optimal rates in a minimax sense.

### Statistical and computational tradeoffs in biclustering

S. Balakrishnan, M. Kolar, A. Rinaldo, A. Singh and L. Wasserman , CMU
We consider the problem of identifying a small sub-matrix of activation in a large noisy matrix. We establish the minimax rate for the problem by showing tight (up to constants) upper and lower bounds on the signal strength needed to identify the sub-matrix. We consider several natural computationally tractable procedures and show that under most parameter scalings they are unable to identify the sub-matrix at the minimax signal strength. While we are unable to directly establish the computational hardness of the problem at the minimax signal strength we discuss connections to some known NP-hard problems and their approximation algorithms.

### Theoretical Basis for More Data Less Work''?

Nati Srebro and Karthik Sridharan, TTI Chicago
We argue that current theory cannot be used to analyze how more data leads to less work, that in-fact for a broad generic class of convex learning problems more data does not lead to less work in the worst case, but in practice, actually more data does lead to less work.

### Anticoncentration regularizers for stochastic combinatorial problems

Shiva Kaul and Geoffrey Gordon, CMU
Statistically optimal estimators often seem difficult to compute. When they are the solution to a combinatorial optimization problem, NP-hardness motivates the use of suboptimal alternatives. For example, the non-convex $\ell_0$ norm is ideal for enforcing sparsity, but is typically overlooked in favor of the convex $\ell_1$ norm. We introduce a new regularizer which is small enough to preserve statistical optimality but large enough to circumvent worst-case computational intractability. This regularizer rounds the objective to a fractional precision and smooths it with a random perturbation. Using this technique, we obtain a combinatorial algorithm for noisy sparsity recovery which runs in polynomial time and requires a minimal amount of data.
Ċ
Alekh Agarwal,
Jan 9, 2012, 11:08 AM
Ċ
Alekh Agarwal,
Nov 30, 2011, 11:27 PM
Ċ
Alekh Agarwal,
Nov 30, 2011, 11:27 PM
Ċ
Alekh Agarwal,
Dec 15, 2011, 3:22 PM
Ċ
Alekh Agarwal,
Nov 30, 2011, 11:27 PM
Ċ
Alekh Agarwal,
Nov 30, 2011, 11:27 PM
Ċ
Alekh Agarwal,
Nov 30, 2011, 11:27 PM
Ċ
Alekh Agarwal,
Nov 30, 2011, 11:28 PM
Ċ
Alekh Agarwal,
Dec 15, 2011, 3:23 PM