LAST EVENT
We finally uploaded the videos of the workshop. Please, go to the page of each talk for details. DESCRIPTION
Overview
This workshop aims to gather researchers in the area of sequential decision making to discuss recent findings and new challenges around the concept of model misspecification. A misspecified model is a model that either (1) cannot be tractably solved, (2) solving the model does not produce an acceptable solution for the target problem, or (3) the model clearly does not describe the available data perfectly. However, even though the model has its issues, we are interested in finding a good policy. The question is thus: How can misspecified models be made to lead to good policies?
We refer to the following (non exhaustive) types of misspecification.
 States and Context. A misspecified state representation relates to research problems such as Hidden Markov Models, Predictive State Representations, Feature Reinforcement Learning, Partially Observable Markov Decision Problems, etc. The related question of misspecified context in contextual bandits is also relevant.
 Dynamics. Consider learning a policy for a class of several MDPs rather than a single MDP, or optimizing a risk averse (as opposed to expected) objective. These approaches could be used to derive a reasonable policy for the target MDP even if the model we solved is misspecified. Thus, robustness, safety, and riskaversion are examples of relevant approaches to this question.
 Actions. The underlying insight of working with highlevel actions built on top of lowerlevel actions is that if we had the right highlevel actions, we would have faster learning/planning. However, finding an appropriate set of highlevel actions can be difficult. One form of model misspecification occurs when the given highlevel actions cannot be combined to derive an acceptable policy.
More generally, since misspecification may slow learning or prevent an algorithm from finding any acceptable solution, improving the efficiency of planning and learning methods under misspecification is of primary importance. At another level, all these challenges can benefit greatly from the identification of finer properties of MDPs (local recoverability, etc.) and better notions of complexity. These questions are deeply rooted in theory and in recent applications in fields diverse as airtraffic control, marketing, and robotics. We thus also want to encourage presentations of challenges that provide a redline and agenda for future research, or a survey of the current achievements and difficulties. This includes concrete problems like Energy management, Smart grids, Computational sustainability and Recommender systems.
We welcome contributions on these exciting questions, with the goals of (1) helping close the gap between strong theoretical guarantees and challenging application requirements, (2) identifying promising directions of near future research, for both applications and theory of sequential decision making, and (3) triggering collaborations amongst researchers on learning good policies despite being given misspecified models.
Motivation, objectives
Despite the success of sequential decision making theory at providing solutions to challenging settings, the field faces a limitation. Often strong theoretical guarantees depend on the assumption that a solution to the class of models considered is a good solution to the target problem. A popular example is that of finitestate MDP learning for which the model of the statespace is assumed known. Such an assumption is however rarely met in practice. Similarly, in recommender systems and contextual bandits, the context may not capture an accurate summary of the users. Developing a methodology for finding, estimating, and dealing with the limitations of the model is paramount to the success of sequential decision processes. Another example of model misspecification occurs in Hierarchical Reinforcement Learning: In many realworld applications, we could solve the problem easily if we had the right set of highlevel actions. Instead, we need to find a way to build those from a cruder set of primitive actions or existing highlevel actions that do not suit the current task.
Yet another applicative challenge is when we face a process that can only be modeled as an MDP evolving in some class of MDPs, instead of a fixed MDP, leading to robust reinforcement learning, or when we call for safety or riskaverse guarantees.
These problems are important bottlenecks standing in the way of applying sequential decision making to challenging application, and motivate the triple goal of this workshop.
Relevance to the community
Misspecification of models (in the senses we consider here) is an important problem that is faced in many – if not all – realworld applications of sequential decision making under uncertainty. While theoretical results have primarily focused on the case when models of the environment are wellspecified, little work has been done on extending the theory to the case of misspecification. Attempting at understanding why and when incorrectly specified models lead to good empirical performance beyond what the current theory explains is also an important goal. We believe that this workshop will be of great interest for both theoreticians and applied researchers in the field.
Academic speakers:
 Thomas Dietterich, Oregon State University
 Raphaël Fonteneau, University of Liège.
 Peter Grünwald, Centrum voor Wiskunde en Informatica
 Joelle Pineau, McGill University
 Ronald Ortner, Montänuniversität Leoben
Industry speakers:
 Georgios Theocharous, Adobe Research
 Esteban Arcaute, WalmartLabs
Organizers
 OdalricAmbrym Maillard, Senior Researcher, The Technion, Israel.
 Timothy Mann, Senior Researcher, The Technion, Israel.
 Shie Mannor, Professor, The Technion, Israel.
 Jeremie Mary, Associate Professor, INRIA Lille  Nord Europe, France.
 Laurent Orseau, Associate Professor, AgroParisTech/INRA, France.
Important Dates
 Workshop call for paper:
 Paper submission deadline:
 October 09th, 2014 at 11:59 pm PST.
 Notification of acceptance:
 Cameraready version:
 Date of the workshop (one day):
 December 12, Montreal, 2014.
Submissions
 The workshop aims to spark vibrant discussion with talks from invited speakers, presentations from authors of accepted papers, and a poster session. We are soliciting two types of contributions:
• Papers (46 pages) for oral or interactive poster presentations
• Extended abstracts (2 pages) for interactive poster presentation
• NIPS papers (8+1 pages) accepted at NIPS 2014.
We encourage submissions from different fields of sequential decision making (e.g., reinforcement learning, online learning, active learning), as well as applicationdomain experts (from e.g., digital marketing, recommender systems, personalized medicine, etc.) addressing the following (nonexhaustive) list of questions and topics:
• Misspecification in model selection.
• Staterepresentations in Reinforcement learning: Hidden Markov Models, Predictive State Representations, Feature Reinforcement Learning, Partially Observable Markov Decision Processes.
• Latent variables in sequential decision making and techniques to handle them.
• Robustness, Safety and Riskaversion in Reinforcement Learning.
• Curiosity and Autonomous learning (reward misspecification).
• Reinforcement Learning with Options.
• Application for the Reinforcement Learning community (Computational Sustainability, Smart Cities, Smart grids, etc.).
• Other topics whose relevance to the workshop is well supported.
Solutions to such challenges will benefit the machine learning community at large, since they also appear in many realworld applications.
 Submitted papers will be evaluated by the PC members. Authors do not need to send an anonymous submission.
 In order to submit an Extended Abstract (2 pages), please sent it to bmgp.2page.submissions@gmail.com, with the title of your submission in the subject.
 In order to submit a Paper (46 pages), please sent it to bmgp.6page.submissions@gmail.com, with the title of your submission in the subject.
 In order to submit an NIPS accepted paper (8+1 pages), sent it to bmgp.nips.submissions@gmail.com, with the title of your submission in the subject.
 In case you have any question about this workshop, you can send an email to nips2014workshopbad_models_to_good_policies@googlegroups.com, with a title starting with "[Question] ".
SCHEDULE
Friday 12 December 2014  Level 5: Room 512 c,g
 8:00  8:15: Introduction by the organizers
Session 1 (8:15 am  10:00 am)
Coffee Break and Posters  30 min (10:00 am  10:30 am)
Session 2 (10:30 am  12:00 am)
LUNCH BREAK (12:00 am  2:00 pm)  2:00  2:05: Program highlights by the organizers
Session 3 (2:00 pm  4:25 pm)
Coffee Break and Posters  30 min (4:25 pm  4:55 pm)
Session 4 (4:55 pm  6:05 pm)
Panel discussion (6:05 pm  6:40 pm)  From Bad Models to Good Policies 6:40 6:45: Final remarks by the organizers
DETAILS AND ABSTRACTSAcademic Talks
Abstract: "Generalized value
iteration can compute confidence intervals on the action values of an
MDP based on samples from that MDP. Different confidence interval
methods (e.g., Hoeffding bound, Empirical Bernstein Bound, Weissman et
al. L1 multinomial interval, etc.) at each state lead to different
confidence intervals throughout the MDP. This talk will address two
questions. First, given a strong simulator for an MDP, a fixed policy,
and a sampling budget, what is the best way to draw samples in order to obtain the tightest bound on the value of the policy in the start state? That is, what combination of confidence interval method and sampling strategy will give the tightest bounds? Second, how should we draw samples in order to simultaneously optimize the policy and obtain the tightest bounds on the resulting policy? Again, what confidence interval method and what sampling strategy should we use. I will present experiments that suggest partial answers to both questions." Abstract: "We give an overview of
algorithms and results for a reinforcement learning setting where the
learner does not have explicit information about the state space of the
underlying Markov decision process (MDP). Instead, the learner has a set
of models at her disposal that map histories (i.e., observations,
chosen actions and collected rewards) to states. Examples for
reinforcement learning in such a setting include e.g. reinforcement
learning in continuous or large state space where models correspond to
aggregations or discretisations of the underlying state space. From the
practical point of view, methods such as context trees or probabilistic
deterministic finite automata can be considered as models in our sense. This talk is based on joint work with OdalricAmbrym Maillard and Daniil Ryabko."
Abstract: "Batch mode reinforcement learning is a subclass of reinforcement
learning for which the decision making problem has to be addressed
without model, using trajectories only (no model, nor simulator nor
additional interactions with the actual system). In this setting, we
propose a discussion about a minmax approach to generalization for
deterministic problems with continuous state space. This approach aims
at computing robust policies considering the fact that the sample of
trajectories may be arbitrarily bad. This discussion will be intertwined
with the description of a fascinating batch mode reinforcement
learningtype problem with trajectories of societies as input, and for
which crucial good decisions have to be taken: the energy transition."
Abstract: "Bayesian inference can behave badly if the model under consideration is wrong yet useful: the posterior may fail to concentrate even for large samples, leading to extreme overfitting in practice. We demonstrate this on a simple regression problem. The problem goes away if we make the socalled learning rate small enough, which essentially amounts to making the prior more and the data less important. Standard Bayes sets the learning rate to 1, which can be too high under model misspecification; in the exponential weights algorithm and PACBayesian inference, close cousins of Bayes popular in the learning theory community, one often sets the learning rate to 1/sqrt{sample size}, which is too low if the setting is not adversarial. We introduce the safe Bayesian estimator*, which learns the learning rate from the data. It behaves essentially as well as standard Bayes if the model is correct but continues to achieve minimax optimal rates with wrong models.
* For an informal introduction to the idea, see Larry Wasserman's blog <http://normaldeviate.wordpress.com/2012/06/26/selfrepairingbayesianinference/> on 'selfrepairing Bayesian inference'." Abstract: "Predictive state representations (PSRs) offer an expressive framework for
modelling partially observable systems. By compactly representing systems as functions of observable quantities, the PSR learning approach avoids
using localminima prone expectationmaximization and instead employs a
globally optimal momentbased algorithm. Moreover, since PSRs do not
require a predetermined latent state structure as an input, they offer an
attractive framework for modelbased reinforcement learning when agents
must plan without a priori access to a system model. Unfortunately, the
expressiveness of PSRs comes with significant computational cost, and this
cost is a major factor inhibiting the use of PSRs in applications. In
order to alleviate this shortcoming, we introduce the notion of compressed
PSRs (CPSRs). The CPSR learning approach combines recent advancements in
dimensionality reduction, incremental matrix decomposition, and compressed
sensing. We show how this approach provides a principled avenue for
learning accurate approximations of PSRs, drastically reducing the
computational costs associated with learning while also providing
effective regularization. Going further, we propose a planning framework
which exploits these learned models. We present empirical results showing
that this approach facilitates modellearning and planning in large
complex partially observable domains."
Industry Talks
Abstract: "The main objective in the ad recommendation problem
is to generate a strategy that selects which offer to present to a user
who is visiting a website. Such a policy could be computed using the
reinforcement learning (RL) technology. A RL policy selects offers
based on all the available knowledge about the user at the current
visit, to maximize the total number of times she will click over
multiple visits, also known as the user's lifetime value
(LTV). The ad recommendation policies are often built using a batch of
data generated by previous deployed strategies of the company. Although
there are many techniques to find a policy given a batch of data, there
are not much results to guarantee that the
obtained policy performs well in the real system without having a
chance to deploy/execute it. One approach for predicting a policy’s
performance is to execute it on a model or a simulator of the real
world. But models are notoriously hard to learn. In
this talk we will present an alternative modelfree approach, which
computes a lower confidence bound on the expected return of a policy." Abstract: "Modern business professionals need to master controls embedded in complex systems to achieve their financial goals. In the ecommerce industry, a natural example is that of paid search advertisement, where controls provided by search engines are simple to understand, but hard to master. From traffic segmentation via keywords (or, more recently, products), to traffic volume and performance management via bids and budgets, business professionals need to achieve highlevel aggregate goals, such as return over adspend (ROAS), while managing finegrained controls. In this talk, we share our learning from building a bidmanagement platform for Adchemy (acquired by @WalmartLabs) that enabled business professionals to only worry about their goals, and leave the finegrained controls to a learningbased system. Such system can be casted as a decisionmaking and optimization problem in a complex, stochastic environment. This requires building an accurate model for controltotargetmetric relationship. A natural approach is to partition the system into smaller components, and build models for each one of them. In our example, the environment can be partitioned into five components: impression volume, click through rate, bid and cost per click relationship, conversion rate, and revenue per conversion. Such partition reduces model complexity, and allows building better models for rare events such as user clicks and conversions; it also provides better interpretability and easier diagnosis of the models. Our models, however, typically suffer from misspecification. This is not a primary concern as we developed a methodology to select models based not only on their goodness of fit, but mainly on the system’s ability to manage the users’ expectations on the target business metric. Our methodology can be summarized as follows: 1. A bootstrapping based nonparametric test to find whether or not the combination of models  one for each component of paidsearch environment  has predictive power with respect to the final metric of interest. 2. An algorithm for finding a good combination of models from a candidate set of models for each component of paidsearch environment, taking into account the final metric of interest. 3. Expectation setting for the stakeholders by recommending time to wait before the performance can be measured reliably."
Contributed Talks
Abstract: "We describe how to use robust Markov decision processes for value function approximation with state aggregation. The robustness serves to reduce the sensitivity to the approximation error of suboptimal policies in comparison to classical methods such as fitted value iteration. This results in reducing the bounds on the γdiscounted infinite horizon performance loss by a factor of 1/(1 − γ) while preserving polynomialtime computational complexity. Our experimental results show that using the robust representation can significantly improve the solution quality with minimal additional computational cost."
Abstract: "Conditional Value at Risk (CVaR) is a prominent risk measure that is being used extensively in various domains. We develop a new formula for the gradient of the CVaR in the form of a conditional expectation. Based on this formula, we propose a novel samplingbased estimator for the CVaR gradient, in the spirit of the likelihoodratio method. We analyze the bias of the estimator, and prove the convergence of a corresponding stochastic gradient descent algorithm to a local CVaR optimum. Our method allows to consider CVaR optimization in new domains. As an example, we consider a reinforcement learning application, and learn a risksensitive controller for the game of Tetris."
Abstract: "For Markov decision processes with long horizons (i.e., discount factors close to one), it is common in practice to use reduced horizons during planning to speed computation. However, perhaps surprisingly, when the model available to the agent is estimated from data, as will be the case in most realworld problems, the policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. The reason, we argue, is that the planning horizon is a complexity control parameter for the class of policies to be learned. We show formally that it has an intuitive, monotonic relationship with a counting measure of complexity. The measure gives rise to a bound on the planning loss, predicting that a planning horizon shorter than the true horizon can reduce overfitting and improve test performance, and we confirm these predictions empirically."
Abstract: "Bayesian methods suffer from the problem of how to specify prior beliefs. One interesting idea is to consider worstcase priors. This requires solving a stochastic zerosum game. In this paper, we extend wellknown results from bandit theory in order to discover minimaxBayes policies and discuss when they are practical."
Abstract: "We consider the problems of learning and planning in Markov decision processes with temporally extended actions represented in the options framework. We propose to use predictions about the duration of extended actions to represent the state and show how this leads to a compact predictive state representation model independent of the set of primitive actions. Then we develop a consistent and efficient spectral learning algorithm for such models."
Posters
 Poster 1: Adaptive Management Strategies for WhiteNose Syndrome, Nick Meyer, E. Laber, K. Pacifici, B. Reich , J. Drake
 Poster 2: Constrained Stochastic Optimal Control with a Baseline Performance Guarantee, Yinlam Chow, M. Ghavamzadeh
 Poster 3: Generalised Entropy MDPs and Minimax Regret, Emmanouil Androulakis, C. Dimitrakakis
 Poster 4: On Model Accuracy and Planning Horizon, Nan Jiang, A. Kulesza, S. Singh, R. Lewis
 Poster 5: Optimizing the CVaR via Sampling, Aviv Tamar, Y. Glassner, S. Mannor
 Poster 6: Predictive Timing Models, PierreLuc Bacon, B. Balle, D. Precup
 Poster 7: Proactive MDPbased Collision Avoidance Algorithm for Autonomous Car, Denis Osipychev, D. Tran, W. Sheng, G. Chowdhary
 Poster 8: RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning, Marek Petrik, D. Subramanian

