Multi-Trade-offs in Machine Learning

NIPS-2012 Workshop
Friday, December 7, 2012

Lake Tahoe, Nevada, United States
Room: Harrahs Tahoe C


One of the main practical goals of machine learning is to identify relevant trade-offs in different problems, formalize, and solve them. We have already achieved fairly good progress in addressing individual trade-offs, such as model order selection or exploration-exploitation. In this workshop we would like to focus on problems that involve more than one trade-off simultaneously. We are interested both in practical problems where "multi-trade-offs" arise and in theoretical approaches to their solution. Obviously, many problems in life cannot be reduced to a single trade-off and it is highly important to improve our ability to address multiple trade-offs simultaneously. Below we provide several examples of situations, where multiple trade-offs arise simultaneously. The goal of the examples is to provide a starting point for a discussion, but they are not limiting the scope and any other multi-trade-off problem is welcome to be discussed at the workshop.

Multi-trade-offs arise naturally in interaction between multiple learning systems or when a learning system faces multiple tasks simultaneously; especially when the systems or tasks share common resources, such as CPU time, memory, sensors, robot body, and so on. For a concrete example, imagine a robot riding a bicycle and balancing a pole. Each task individually (cycling and pole balancing) can be modeled as a separate optimization problem, but their solutions have to be coordinated, since they share robot resources and robot body. More generally, each learning system or system component has its own internal trade-offs, which have to be balanced against trade-offs of other systems, whereas shared resources introduce external trade-offs that enforce cooperation. The complexity of interaction can vary from independent systems sharing common resources to systems with various degrees of relation between their inputs and tasks. In multi-agent systems communication between the agents introduces an additional trade-off.

We are also interested in multi-trade-offs that arise within individual systems. For example, model order selection and computational complexity [1], or model order selection and exploration-exploitation [2]. For a specific example of this type of problems, imagine a system for real-time prediction of the location of a ball in table tennis. This system has to balance between at least three objectives that interact in a non-trivial manner: (1) complexity of the model of flight trajectory, (2) statistical reliability of the model, (3) computational requirements. Complex models can potentially provide better predictions, but can also lead to overfitting (trade-off between (1) and (2)) and are computationally more demanding. At the same time, there is also a trade-off between having fast crude predictions or slower, but more precise estimations (trade-off between (3) and (1)+(2)). Despite the complex nature of multi-trade-offs, there is still hope that they can be formulated as convex problems, at least in some situations [3].

[1] Shai Shalev-Shwartz and Nathan Srebro. "SVM Optimization: Inverse Dependence on Training Set Size", ICML, 2008.
[2] Yevgeny Seldin, Peter Auer, François Laviolette, John Shawe-Taylor, and Ronald Ortner. "PAC-Bayesian Analysis of Contextual Bandits", NIPS, 2011.
[3] Andreas Argyriou, Theodoros Evgeniou and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 2008, Volume 73, Number 3.

Call for Contributions

We invite submission of abstracts and open problems to the workshop. Abstracts and open problems should be at most 4 pages long in the NIPS format (appendices are allowed, but the organizers reserve the right to evaluate the submissions based on the first 4 pages only). Submissions should be NOT anonymous. Selected abstracts and open problems will be presented as talks or posters during the workshop. Contributions should be emailed to


Submission Deadline: October 16.
Notification of Acceptance: October 30.


  • Theory and application-oriented contributions are equally welcome.
  • All the submissions should indicate clearly at least two non-trivial trade-offs they are addressing.
  • Submission of previously published work or work under review is allowed, in particular NIPS-2012 submissions. However, for oral presentations preference will be given to novel work or work that was not yet presented elsewhere (for example, recent journal publications or NIPS posters). All double submissions must be clearly declared as such!

Invited Speakers

Shai Shalev-Shwartz, The Hebrew University of Jerusalem
 Jan Peters, Technische Universitaet Darmstadt and Max Planck Institute for Intelligent Systems
Csaba Szepesvári, University of Alberta


Yevgeny Seldin, Queensland University of Technology and University College London
Guy Lever, University College London
John Shawe-Taylor, University College London
Koby Crammer, The Technion
Nicolò Cesa-Bianchi, Università degli Studi di Milano
François Laviolette, Université Laval (Québec)
Gábor Lugosi, Pompeu Fabra University
Peter Bartlett, UC Berkeley and Queensland University of Technology


07:30 - 07:40 Opening remarks
07:40 - 08:35 Invited talk - Csaba Szepesvári - "Tradeoffs in online learning under partial information feedback"
08:35 - 08:55 Contributed talk - Cavallanti & Cesa-Bianchi - "Memory Trade-offs in Online Multitask Classification"
08:55 - 09:25 coffee break
09:25 - 09:35 Spotlight - Germain et al. - "PAC-Bayesian Learning and Domain Adaptation"
09:35 - 10:30 Invited talk - Shai Shalev-Shwartz - "The approximation-estimation-computational tradeoff"
10:30 - 15:30  SKI BREAK
15:30 - 16:25  Invited Talk - Jan Peters
16:25 - 16:45  Contributed talk - Machart et al. - "Optimal Computational Trade-Off of Inexact Proximal Methods"
16:45 - 17:05  Contributed talk - Xu et al. - "Sparse Algorithms are not Stable: A No-free-lunch Theorem"
17:05 - 18:30  coffee break + poster session



Talk Abstracts

Csaba Szepesvári - Tradeoffs in online learning under partial information feedback

How should an online learner choose its actions to trade off between exploration and exploitation to maximize the accuracy of predictions where the choice of actions directly influence what information the learner receives? First, using the abstract framework of partial monitoring, we provide a full answer to this question for any discrete prediction problems: As it turns out, the difficulty at the optimal tradeoff depends on a novel, yet intuitive geometric-algebraic condition. We also discuss tradeoffs and open problems concerning adaptation to benign environments, predictions with side-information, a specific problem when the learner needs to pay for accessing the feature values and the label, and the influence of delays in receiving the feedback.
This is joint work with András Antos Gábor Bartók, Pooria Joulani, András György, Dávid Pál, and Navid Zolghadr.

Giovanni Cavallanti and Nicolò Cesa-Bianchi - Memory Trade Offs in Online Multitask Classification

We investigate online kernel algorithms which simultaneously process multiple classification tasks while a fixed constraint is imposed on the size of their active sets. We provide new insights on the multitask kernel and discuss how it can
be coupled with the RBP algorithm [2] to balance the two seemingly contrasting goals of learning from multiple tasks and keeping a constant memory footprint. Two new projection-based algorithms are introduced showing different trade offs
on how the available memory is managed with respect to the prior information about the learning tasks. Experiments prove that budget multitask online algorithms can efficiently tackle real world learning problems involving multiple tasks.


Pascal Germain, Amaury Habrard, François Laviolette and Emilie Morvant - PAC-Bayesian Learning and Domain Adaptation

In machine learning, Domain Adaptation (DA) arises when the distribution generating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is an hard task even under strong assumptions, among which the covariate-shift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another popular approach is to consider an hypothesis class that moves closer the two distributions while implying a low-error for both tasks. This is a VC-dim approach that restricts the complexity of an hypothesis class in order to get good generalization. Instead, we propose a PAC-Bayesian approach that seeks for suitable weights to be given to each hypothesis in order to build a majority vote. We prove a new DA bound in the PAC-Bayesian context. This leads us to design the first DA-PAC-Bayesian algorithm based on the minimization of the proposed bound. Doing so, we seek for a ρ-weighted majority vote that takes into account a trade-off between three quantities. The first two quantities being, as usual in the PAC-Bayesian approach, (a) the complexity of the majority vote (measured by a Kullback-Leibler divergence) and (b) its empirical risk (measured by the ρ-average errors on the source sample). The third quantity is (c) the capacity of the majority vote to distinguish some structural difference between the source and target samples.

Shai Shalev-Shwartz - The approximation-estimation-computational tradeoff

When analyzing the error of a learning algorithm, it is common to decompose the error into  approximation error (measuring how well the hypothesis class fits the problem) and estimation error (due to the fact we only receive a finite training set). In practice, we usually pay an additional error, called optimization error, due to the fact that we have a limited computational power. I will describe this triple tradeoff and will demonstrate how more training examples can lead to more efficient learning algorithms.

Jan Peters - TBA

Pierre Machart, Sandrine Anthoine and Luca Baldassarre - Optimal Computational Trade-Off of Inexact Proximal Methods

In this paper, we investigate the trade-off between convergence rate and computational cost when minimizing a composite functional with proximal-gradient methods, which are popular optimisation tools in machine learning. We consider the case when the proximity operator is approximated via an iterative procedure, which yields algorithms with two nested loops. We show that the strategy minimizing the computational cost to reach a desired accuracy in finite time is to keep the number of inner iterations constant, which differs from the strategy indicated by a convergence rate analysis.

Huan Xu, Constantine Caramanis and Shie Mannor - Sparse Algorithms are Not Stable: A No-free-lunch Theorem

We consider two widely used notions in machine learning, namely: sparsity and stability. Both notions are deemed desirable, and are believed to lead to good generalization ability. We show that these two notions contradict each other: a sparse algorithm can not be stable and vice versa. Thus, one has to tradeoff sparsity and stability in designing a learning algorithm. This implies that, in contrast to \ell_2 regularized regression, \ell_1 regularized regression (Lasso) cannot be stable.

Kirill Trapeznikov, Venkatesh Saligrama, and David Castanon - Sequential Decision System Design (Poster)

Conventional problems in machine learning involve finding decision rules (models) that lead to a good prediction on unseen examples. It is well understood that this problem requires trading off model fitting on training data with model complexity. We consider problems where acquiring measurements is itself limited by a budget. Specifically, we consider problems with several sensing modalities, some expensive and some cheap. This setting gives rise to a new tradeoff between prediction error and budget. Such problems arise in many applications including homeland security and medical diagnosis where a sequential decision theoretic approach is desired. The goal in these scenarios is to classify examples with low cost sensors and limit the number of examples for which more expensive or time consuming informative sensor is required. In modern passenger screening systems for explosives detection, a suite of sensors such as X-ray backscatter scanners (cheap and fast), millimeter wave imagers (expensive and low-throughput), magnetometers, video, IR imagers in different bands, and/or physical (human) search are being employed. Such systems must maintain a throughput constraint in order to keep pace with arriving traffic. In clinical diagnosis, doctors use a suite of sensors for detecting and assessing the severity of (breast cancer) mammographic mass lesions (malicious or benign) including genetic markers, CT images from different views, 3-D CT tomographic reconstructions, optical tomography imaging, ultrasound imaging, elastography imaging, manual palpation, and biopsy, among others. Many of these sensors provide imagery input for individual human radiologist scoring. The different sensing modalities have diverse costs, in terms of health risks (radiation exposure) and monetary expense.

In this work, we formalize these tradeoffs in the form of an Empirical Risk Minimization problem. If the underlying joint distributions (sensor modalities and class labels) are known then the problem boils down to a standard Markov Decision Problem (MDP), which can be efficiently solved using Dynamic Programming. However, our scenarios have sensors that produce high-dimensional measurements (images). So, not only are the underlying distributions not known, but also impossible to accurately estimate from limited training data. We develop a novel sequential approach based on empirical estimates of cost-to-go functions coupled with optimizing reject classifiers for deriving stage-by-stage rules.

Link to the paper:

Nishant A. Mehta, Dongryeol Lee and Alexander G. Gray - Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL (Poster)

Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks' empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging.