Friday, December 7, 2012

Room: Harrahs Tahoe C

## AbstractOne of the main practical goals of machine learning is to identify relevant trade-offs in different problems, formalize, and solve them. We have already achieved fairly good progress in addressing individual trade-offs, such as model order selection or exploration-exploitation. In this workshop we would like to focus on problems that involve more than one trade-off simultaneously. We are interested both in practical problems where "multi-trade-offs" arise and in theoretical approaches to their solution. Obviously, many problems in life cannot be reduced to a single trade-off and it is highly important to improve our ability to address multiple trade-offs simultaneously. Below we provide several examples of situations, where multiple trade-offs arise simultaneously. The goal of the examples is to provide a starting point for a discussion, but they are not limiting the scope and any other multi-trade-off problem is welcome to be discussed at the workshop. Multi-trade-offs arise naturally in interaction between multiple learning systems or when a learning system faces multiple tasks simultaneously; especially when the systems or tasks share common resources, such as CPU time, memory, sensors, robot body, and so on. For a concrete example, imagine a robot riding a bicycle and balancing a pole. Each task individually (cycling and pole balancing) can be modeled as a separate optimization problem, but their solutions have to be coordinated, since they share robot resources and robot body. More generally, each learning system or system component has its own internal trade-offs, which have to be balanced against trade-offs of other systems, whereas shared resources introduce external trade-offs that enforce cooperation. The complexity of interaction can vary from independent systems sharing common resources to systems with various degrees of relation between their inputs and tasks. In multi-agent systems communication between the agents introduces an additional trade-off. We are also interested in multi-trade-offs that arise within individual systems. For example, model order selection and computational complexity [1], or model order selection and exploration-exploitation [2]. For a specific example of this type of problems, imagine a system for real-time prediction of the location of a ball in table tennis. This system has to balance between at least three objectives that interact in a non-trivial manner: (1) complexity of the model of flight trajectory, (2) statistical reliability of the model, (3) computational requirements. Complex models can potentially provide better predictions, but can also lead to overfitting (trade-off between (1) and (2)) and are computationally more demanding. At the same time, there is also a trade-off between having fast crude predictions or slower, but more precise estimations (trade-off between (3) and (1)+(2)). Despite the complex nature of multi-trade-offs, there is still hope that they can be formulated as convex problems, at least in some situations [3]. References: ## Call for ContributionsWe invite submission of IMPORTANT DATES
EVALUATION CRITERIA - Theory and application-oriented contributions are equally welcome.
- All the submissions should indicate clearly at least two non-trivial trade-offs they are addressing.
- Submission of previously published work or work under review is allowed, in particular NIPS-2012 submissions. However, for oral presentations preference will be given to novel work or work that was not yet presented elsewhere (for example, recent journal publications or NIPS posters). All double submissions must be clearly declared as such!
## Invited Speakers
Shai Shalev-Shwartz, The Hebrew University of Jerusalem ## Organizers
Yevgeny Seldin, Queensland University of Technology and University College London ## Schedule07:30 - 07:40 Opening remarks ## Sponsors## Talk Abstracts
How should an online learner choose its actions to trade off between exploration and exploitation to maximize the accuracy of predictions where the choice of actions directly influence what information the learner receives? First, using the abstract framework of partial monitoring, we provide a full answer to this question for any discrete prediction problems: As it turns out, the difficulty at the optimal tradeoff depends on a novel, yet intuitive geometric-algebraic condition. We also discuss tradeoffs and open problems concerning adaptation to benign environments, predictions with side-information, a specific problem when the learner needs to pay for accessing the feature values and the label, and the influence of delays in receiving the feedback.
We investigate online kernel algorithms which simultaneously process multiple classiﬁcation tasks while a ﬁxed constraint is imposed on the size of their active sets. We provide new insights on the multitask kernel and discuss how it can Slides:
In machine learning, Domain Adaptation (DA) arises when the distribution generating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is an hard task even under strong assumptions, among which the covariate-shift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another popular approach is to consider an hypothesis class that moves closer the two distributions while implying a low-error for both tasks. This is a VC-dim approach that restricts the complexity of an hypothesis class in order to get good generalization. Instead, we propose a PAC-Bayesian approach that seeks for suitable weights to be given to each hypothesis in order to build a majority vote. We prove a new DA bound in the PAC-Bayesian context. This leads us to design the first DA-PAC-Bayesian algorithm based on the minimization of the proposed bound. Doing so, we seek for a ρ-weighted majority vote that takes into account a trade-off between three quantities. The first two quantities being, as usual in the PAC-Bayesian approach, (a) the complexity of the majority vote (measured by a Kullback-Leibler divergence) and (b) its empirical risk (measured by the ρ-average errors on the source sample). The third quantity is (c) the capacity of the majority vote to distinguish some structural difference between the source and target samples.
When analyzing the error of a learning algorithm, it is common to decompose the error into approximation error (measuring how well the hypothesis class fits the problem) and estimation error (due to the fact we only receive a finite training set). In practice, we usually pay an additional error, called optimization error, due to the fact that we have a limited computational power. I will describe this triple tradeoff and will demonstrate how more training examples can lead to more efficient learning algorithms.
In this paper, we investigate the trade-off between convergence rate and computational cost when minimizing a composite functional with proximal-gradient methods, which are popular optimisation tools in machine learning. We consider the case when the proximity operator is approximated via an iterative procedure, which yields algorithms with two nested loops. We show that the strategy minimizing the computational cost to reach a desired accuracy in finite time is to keep the number of inner iterations constant, which differs from the strategy indicated by a convergence rate analysis.
We consider two widely used notions in machine learning, namely: sparsity and stability. Both notions are deemed desirable, and are believed to lead to good generalization ability. We show that these two notions contradict each other: a sparse algorithm can not be stable and vice versa. Thus, one has to tradeoff sparsity and stability in designing a learning algorithm. This implies that, in contrast to \ell_2 regularized regression, \ell_1 regularized regression (Lasso) cannot be stable.
Conventional problems in machine learning involve finding decision rules (models) that lead to a good prediction on unseen examples. It is well understood that this problem requires trading off model fitting on training data with model complexity. We consider problems where acquiring measurements is itself limited by a budget. Specifically, we consider problems with several sensing modalities, some expensive and some cheap. This setting gives rise to a new tradeoff between prediction error and budget. Such problems arise in many applications including homeland security and medical diagnosis where a sequential decision theoretic approach is desired. The goal in these scenarios is to classify examples with low cost sensors and limit the number of examples for which more expensive or time consuming informative sensor is required. In modern passenger screening systems for explosives detection, a suite of sensors such as X-ray backscatter scanners (cheap and fast), millimeter wave imagers (expensive and low-throughput), magnetometers, video, IR imagers in different bands, and/or physical (human) search are being employed. Such systems must maintain a throughput constraint in order to keep pace with arriving traffic. In clinical diagnosis, doctors use a suite of sensors for detecting and assessing the severity of (breast cancer) mammographic mass lesions (malicious or benign) including genetic markers, CT images from different views, 3-D CT tomographic reconstructions, optical tomography imaging, ultrasound imaging, elastography imaging, manual palpation, and biopsy, among others. Many of these sensors provide imagery input for individual human radiologist scoring. The different sensing modalities have diverse costs, in terms of health risks (radiation exposure) and monetary expense. In this work, we formalize these tradeoffs in the form of an Empirical Risk Minimization problem. If the underlying joint distributions (sensor modalities and class labels) are known then the problem boils down to a standard Markov Decision Problem (MDP), which can be efficiently solved using Dynamic Programming. However, our scenarios have sensors that produce high-dimensional measurements (images). So, not only are the underlying distributions not known, but also impossible to accurately estimate from limited training data. We develop a novel sequential approach based on empirical estimates of cost-to-go functions coupled with optimizing reject classifiers for deriving stage-by-stage rules.
Link to the paper:
Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks' empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. |