COST : NIPS
2011
Workshop on
Computational Tradeoffs in Statistical Learning
Sierra Nevada, Spain

Stochastic Algorithms for OnePass Learning
Leon Bottou, Microsoft Research
The goal of the presentation is to describe practical
stochastic gradient algorithms that process each training example only
once, yet asymptotically match the performance of the true
optimum. This statement needs, of course, to be made more precise. To
achieve this, we'll review the works of Nevel'son and Has'minskij
(1972), Fabian (1973, 1978), Murata & Amari (1998), Bottou & LeCun
(2004), Polyak & Juditsky (1992), Wei Xu (2010), and Bach & Moulines
(2011). We will then show how these ideas lead to practical algorithms
and new challenges. 
Using More Data to Speedup Training Time
ShaiShalev Shwartz, Hebrew University
Recently, there has been a growing interest in understanding how more
data can be leveraged to reduce the required training runtime. I will
describe a systematic study of the runtime of learning as a function
of the number of available training examples, and underscore the main
highlevel techniques. In particular, a formal positive result will be
presented, showing that even in the unrealizable case, the runtime can
decrease exponentially while only requiring a polynomial growth of the
number of examples. The construction corresponds to a synthetic
learning problem and an interesting open question is if and how the
tradeoff can be shown for more natural learning problems. I will spell
out several interesting candidates of natural learning problems for
which we conjecture that there is a tradeoff between computational and
sample complexity.
Based on joint work with Ohad Shamir and Eran Tromer.

Garvesh Raskutti, Martin Wainwright and Bin Yu, U C Berkeley
The goal of nonparametric regression is to estimate an unknown
function $f^*$ based on $n$ i.i.d. observations of the form $y_i =
f^*(x_i) + w_i$, where $\{w_i\}_{i=1}^n$ are additive noise
variables. Simply choosing a function to minimize the leastsquares
loss $\frac{1}{2 n} \sum_{i=1}^{n}{(y_i  f(x_i))^2}$ will lead to
``overfitting'', so that various estimators are based on different
types of regularization. The \emph{early stopping} strategy is to run
an iterative algorithm such as gradient descent for a fixed but finite
number of iterations. Early stopping is known to yield estimates with
better prediction accuracy than those obtained by running the
algorithm for an infinite number of iterations. Although bounds on
this prediction error are known for certain function classes and step
size choices, the biasvariance tradeoffs for arbitrary reproducing
kernel Hilbert spaces (RKHSs) and arbitrary choices of stepsizes have
not been wellunderstood to date. In this paper, we derive upper
bounds on both the $L^2(\mathbb{P}_n)$ and $L^2(\mathbb{P})$ error for
arbitrary RKHSs, and provide an explicit and easily computable
datadependent stopping rule. In particular, it depends only on the
sum of stepsizes and the eigenvalues of the empirical kernel matrix
for the RKHS. For Sobolev spaces and finiterank kernel classes, we
show that our stopping rule yields estimates that achieve the
statistically optimal rates in a minimax sense.

S. Balakrishnan, M. Kolar, A. Rinaldo, A. Singh and L. Wasserman , CMU
We consider the problem of identifying a small
submatrix of activation in a large noisy matrix. We establish the
minimax rate for the problem by showing tight (up to constants) upper
and lower bounds on the signal strength needed to identify the
submatrix. We consider several natural computationally tractable
procedures and show that under most parameter scalings they are unable
to identify the submatrix at the minimax signal strength. While we
are unable to directly establish the computational hardness of the
problem at the minimax signal strength we discuss connections to some
known NPhard problems and their approximation algorithms.

Nati Srebro and Karthik Sridharan, TTI Chicago
We argue that current theory cannot be used to analyze how more
data leads to less work, that infact for a broad generic class of
convex learning problems more data does not lead to less work
in the worst case, but in practice, actually more data does
lead to less work.

Shiva Kaul and Geoffrey Gordon, CMU
Statistically optimal estimators often seem difficult
to compute. When they are the solution to a combinatorial optimization
problem, NPhardness motivates the use of suboptimal alternatives. For
example, the nonconvex $\ell_0$ norm is ideal for enforcing sparsity,
but is typically overlooked in favor of the convex $\ell_1$ norm. We
introduce a new regularizer which is small enough to preserve
statistical optimality but large enough to circumvent worstcase
computational intractability. This regularizer rounds the objective to
a fractional precision and smooths it with a random
perturbation. Using this technique, we obtain a combinatorial
algorithm for noisy sparsity recovery which runs in polynomial time
and requires a minimal amount of data.


Updating...
Ċ Alekh Agarwal, Jan 9, 2012, 11:08 AM
Ċ Alekh Agarwal, Nov 30, 2011, 11:27 PM
Ċ Alekh Agarwal, Nov 30, 2011, 11:27 PM
Ċ Alekh Agarwal, Dec 15, 2011, 3:22 PM
Ċ Alekh Agarwal, Nov 30, 2011, 11:27 PM
Ċ Alekh Agarwal, Nov 30, 2011, 11:27 PM
Ċ Alekh Agarwal, Nov 30, 2011, 11:27 PM
Ċ Alekh Agarwal, Nov 30, 2011, 11:28 PM
Ċ Alekh Agarwal, Dec 15, 2011, 3:23 PM
