2014 NIPS Workshop "From Bad Models to Good Policies" (Sequential Decision Making under Uncertainty)


We finally uploaded the videos of the workshop.
Please, go to the page of each talk for details.

NIPS Workshop "From Bad Models to Good Policies"

The schedule can exported using the following link: https://www.google.com/calendar/ical/9eurk1c9hf1m5tu30eq0tfrok0%40group.calendar.google.com/public/basic.ics



This workshop aims to gather researchers in the area of sequential decision making to discuss recent findings and new challenges around the concept of model misspecification. A misspecified model is a model that either (1) cannot be tractably solved, (2) solving the model does not produce an acceptable solution for the target problem, or (3) the model clearly does not describe the available data perfectly. However, even though the model has its issues, we are interested in finding a good policy. The question is thus: How can misspecified models be made to lead to good policies?

We refer to the following (non exhaustive) types of misspecification.
  1. States and Context. A misspecified state representation relates to research problems such as Hidden Markov Models, Predictive State Representations, Feature Reinforcement Learning, Partially Observable Markov Decision Problems, etc. The related question of misspecified context in contextual bandits is also relevant.
  2. Dynamics. Consider learning a policy for a class of several MDPs rather than a single MDP, or optimizing a risk averse (as opposed to expected) objective. These approaches could be used to derive a reasonable policy for the target MDP even if the model we solved is misspecified. Thus, robustness, safety, and risk-aversion are examples of relevant approaches to this question.
  3. Actions. The underlying insight of working with high-level actions built on top of lower-level actions is that if we had the right high-level actions, we would have faster learning/planning. However, finding an appropriate set of high-level actions can be difficult. One form of model misspecification occurs when the given high-level actions cannot be combined to derive an acceptable policy.

More generally, since misspecification may slow learning or prevent an algorithm from finding any acceptable solution, improving the efficiency of planning and learning methods under misspecification is of primary importance. At another level, all these challenges can benefit greatly from the identification of finer properties of MDPs (local recoverability, etc.) and better notions of complexity. These questions are deeply rooted in theory and in recent applications in fields diverse as air-traffic control, marketing, and robotics. We thus also want to encourage presentations of challenges that provide a red-line and agenda for future research, or a survey of the current achievements and difficulties. This includes concrete problems like Energy management, Smart grids, Computational sustainability and Recommender systems.

We welcome contributions on these exciting questions, with the goals of (1) helping close the gap between strong theoretical guarantees and challenging application requirements, (2) identifying promising directions of near future research, for both applications and theory of sequential decision making, and (3) triggering collaborations amongst researchers on learning good policies despite being given misspecified models.

    Motivation, objectives

    Despite the success of sequential decision making theory at providing solutions to challenging settings, the field faces a limitation. Often strong theoretical guarantees depend on the assumption that a solution to the class of models considered is a good solution to the target problem. A popular example is that of finite-state MDP learning for which the model of the state-space is assumed known. Such an assumption is however rarely met in practice. Similarly, in recommender systems and contextual bandits, the context may not capture an accurate summary of the users. Developing a methodology for finding, estimating, and dealing with the limitations of the model is paramount to the success of sequential decision processes. Another example of model misspecification occurs in Hierarchical Reinforcement Learning: In many real-world applications, we could solve the problem easily if we had the right set of high-level actions. Instead, we need to find a way to build those from a cruder set of primitive actions or existing high-level actions that do not suit the current task.
    Yet another applicative challenge is when we face a process that can only be modeled as an MDP evolving in some class of MDPs, instead of a fixed MDP, leading to robust reinforcement learning, or when we call for safety or risk-averse guarantees.

    These problems are important bottlenecks standing in the way of applying sequential decision making to challenging application, and motivate the triple goal of this workshop.

    Relevance to the community

    Misspecification of models (in the senses we consider here) is an important problem that is faced in many – if not all – real-world applications of sequential decision making under uncertainty. While theoretical results have primarily focused on the case when models of the environment are well-specified, little work has been done on extending the theory to the case of misspecification. Attempting at understanding why and when incorrectly specified models lead to good empirical performance beyond what the current theory explains is also an important goal. We believe that this workshop will be of great interest for both theoreticians and applied researchers in the field.

    Invited Speakers

    Academic speakers:
    • Thomas Dietterich, Oregon State University   
    • Raphaël Fonteneau, University of Liège.
    • Peter Grünwald, Centrum voor Wiskunde en Informatica   
    • Joelle Pineau, McGill University              
    • Ronald Ortner, Montänuniversität Leoben      
    Industry speakers:
    • Georgios Theocharous, Adobe Research            
    • Esteban Arcaute, WalmartLabs               


    • Odalric-Ambrym Maillard, Senior Researcher, The Technion, Israel.
    • Timothy Mann, Senior Researcher, The Technion, Israel.
    • Shie Mannor, Professor, The Technion, Israel.
    • Jeremie Mary, Associate Professor, INRIA Lille - Nord Europe, France.
    • Laurent Orseau, Associate Professor, AgroParisTech/INRA, France.

    Important Dates

    • Workshop call for paper:
      • August 20th, 2014.
    • Paper submission deadline:
      • October 09th, 2014 at 11:59 pm PST.
    • Notification of acceptance:
      • October 23rd, 2014.
    • Camera-ready version:
      • November 27th, 2014.
    • Date of the workshop (one day):
      • December 12, Montreal, 2014.


      • The workshop aims to spark vibrant discussion with talks from invited speakers, presentations from authors of accepted papers, and a poster session. We are soliciting two types of contributions:
        • Papers (4-6 pages) for oral or interactive poster presentations
        • Extended abstracts (2 pages) for interactive poster presentation
        • NIPS papers (8+1 pages) accepted at NIPS 2014.
        We encourage submissions from different fields of sequential decision making (e.g., reinforcement learning, online learning, active learning), as well as application-domain experts (from e.g., digital marketing, recommender systems, personalized medicine, etc.) addressing the following (non-exhaustive) list of questions and topics:
        • Misspecification in model selection.
        • State-representations in Reinforcement learning: Hidden Markov Models, Predictive State Representations, Feature Reinforcement Learning, Partially Observable Markov Decision Processes.
        • Latent variables in sequential decision making and techniques to handle them.
        • Robustness, Safety and Risk-aversion in Reinforcement Learning.
        • Curiosity and Autonomous learning (reward misspecification).
        • Reinforcement Learning with Options.
        • Application for the Reinforcement Learning community (Computational Sustainability, Smart Cities, Smart grids, etc.).
        • Other topics whose relevance to the workshop is well supported.
        Solutions to such challenges will benefit the machine learning community at large, since they also appear in many real-world applications.
      • Submitted papers will be evaluated by the PC members. Authors do not need to send an anonymous submission.
      • In order to submit an Extended Abstract (2 pages), please sent it to bmgp.2page.submissions@gmail.com, with the title of your submission in the subject.
      • In order to submit a Paper (4-6 pages), please sent it to bmgp.6page.submissions@gmail.com, with the title of your submission in the subject.
      • In order to submit an NIPS accepted paper  (8+1 pages), sent it to bmgp.nips.submissions@gmail.com, with the title of your submission in the subject.
      • In case you have any question about this workshop, you can send an email to nips2014-workshop-bad_models_to_good_policies@googlegroups.com, with a title starting with "[Question] ".


      Friday 12 December 2014 - Level 5: Room 512 c,g

      • 8:00 - 8:15: Introduction by the organizers

      Session 1 (8:15 am - 10:00 am)

      Coffee Break and Posters - 30 min (10:00 am - 10:30 am)

      Session 2 (10:30 am - 12:00 am)

      LUNCH BREAK (12:00 am - 2:00 pm) 

      • 2:00 - 2:05: Program highlights by the organizers

      Session 3  (2:00 pm - 4:25 pm)

      Short break (10 minutes)

      Coffee Break and Posters  - 30 min (4:25 pm - 4:55 pm)

      Session 4  (4:55 pm - 6:05 pm)

      Panel discussion (6:05 pm - 6:40 pm) - From Bad Models to Good Policies

      • 6:40- 6:45: Final remarks by the organizers


      Academic Talks

          Abstract: "Generalized value iteration can compute confidence intervals on the action values of an MDP based on samples from that MDP. Different confidence interval methods (e.g., Hoeffding bound, Empirical Bernstein Bound, Weissman et al. L1 multinomial interval, etc.)  at each state lead to different confidence intervals throughout the MDP. This talk will address two questions. First, given a strong simulator for an MDP, a fixed policy, and a sampling budget, what is the best way to draw samples in order to obtain the tightest bound on the value of the policy in the start state?  That is, what combination of confidence interval method and sampling strategy will give the tightest bounds?  Second, how should we draw samples in order to simultaneously optimize the policy and obtain the tightest bounds on the resulting policy?  Again, what confidence interval method and what sampling strategy should we use. I will present experiments that suggest partial answers to both questions."

          Abstract: "We give an overview of algorithms and results for a reinforcement learning setting where the learner does not have explicit information about the state space of the underlying Markov decision process (MDP). Instead, the learner has a set of models at her disposal that map histories (i.e., observations, chosen actions and collected rewards) to states. Examples for reinforcement learning in such a setting include e.g. reinforcement learning in continuous or large state space where models correspond to aggregations or discretisations of the underlying state space. From the practical point of view, methods such as context trees or probabilistic deterministic finite automata can be considered as models in our sense.
      This talk is based on joint work with Odalric-Ambrym Maillard and Daniil Ryabko.
         Abstract: "Batch mode reinforcement learning is a subclass of reinforcement learning for which the decision making problem has to be addressed without model, using trajectories only (no model, nor simulator nor additional interactions with the actual system). In this setting, we propose a discussion about a minmax approach to generalization for deterministic problems with continuous state space. This approach aims at computing robust policies considering the fact that the sample of trajectories may be arbitrarily bad. This discussion will be intertwined with the description of a fascinating batch mode reinforcement learning-type problem with trajectories of societies as input, and for which crucial good decisions have to be taken: the energy transition."
          Abstract: "Bayesian inference can behave badly if the model under consideration is wrong yet useful: the posterior may fail to concentrate even for large samples, leading to extreme overfitting in practice. We demonstrate this on a simple regression problem. The problem goes away if we make the so-called learning rate small enough, which essentially amounts to making the prior more and the data less important. Standard Bayes sets the learning rate to 1, which can be too high under model misspecification; in the  exponential weights algorithm and PAC-Bayesian inference, close cousins of Bayes popular in the learning theory community, one often sets the learning rate to 1/sqrt{sample size}, which is too low if the setting is not adversarial. We introduce the safe Bayesian estimator*, which learns the learning rate from the data. It behaves essentially as well as standard Bayes if the model is correct but continues to achieve minimax optimal rates with wrong models.

      * For an informal introduction to the idea, see Larry Wasserman's blog <http://normaldeviate.wordpress.com/2012/06/26/self-repairing-bayesian-inference/> on 'self-repairing Bayesian inference'.
          Abstract: "Predictive state representations (PSRs) offer an expressive framework for modelling partially observable systems. By compactly representing systems as functions of observable quantities, the PSR learning approach avoids using local-minima prone expectation-maximization and instead employs a globally optimal moment-based algorithm. Moreover, since PSRs do not require a predetermined latent state structure as an input, they offer an attractive framework for model-based reinforcement learning when agents must plan without a priori access to a system model. Unfortunately, the expressiveness of PSRs comes with significant computational cost, and this cost is a major factor inhibiting the use of PSRs in applications. In order to alleviate this shortcoming, we introduce the notion of compressed PSRs (CPSRs). The CPSR learning approach combines recent advancements in dimensionality reduction, incremental matrix decomposition, and compressed sensing. We show how this approach provides a principled avenue for learning accurate approximations of PSRs, drastically reducing the computational costs associated with learning while also providing effective regularization. Going further, we propose a planning framework which exploits these learned models. We present empirical results showing that this approach facilitates model-learning and planning in large complex partially observable domains."

      Industry Talks

          Abstract: "The main objective in the ad recommendation problem is to generate a strategy that selects which offer to present to a user who is visiting a website. Such a policy could be computed using the reinforcement learning (RL) technology. A RL policy selects offers based on all the available knowledge about the user at the current visit, to maximize the total number of times she will click over multiple visits, also known as the user's life-time value (LTV). The ad recommendation policies are often built using a batch of data generated by previous deployed strategies of the company. Although there are many techniques to find a policy given a batch of data, there are not much results to guarantee that the obtained policy performs well in the real system without having a chance to deploy/execute it.  One approach for predicting a policy’s performance is to execute it on a model or a simulator of the real world.  But models are notoriously hard to learn.  In this talk we will present an alternative model-free approach, which computes a lower confidence bound on the expected return of a policy."

          Abstract: "Modern business professionals need to master controls embedded in complex systems to achieve their financial goals. In the e-commerce industry, a natural example is that of paid search advertisement, where controls provided by search engines are simple to understand, but hard to master. From traffic segmentation via keywords (or, more recently, products), to traffic volume and performance management via bids and budgets, business professionals need to achieve high-level aggregate goals, such as return over ad-spend (ROAS), while managing fine-grained controls. In this talk, we share our learning from building a bid-management platform for Adchemy (acquired by @WalmartLabs) that enabled business professionals to only worry about their goals, and leave the fine-grained controls to a learning-based system. Such system can be casted as a decision-making and optimization problem in a complex, stochastic environment. This requires building an accurate model for control-to-target-metric relationship. A natural approach is to partition the system into smaller components, and build models for each one of them. In our example, the environment can be partitioned into five components: impression volume, click through rate, bid and cost per click relationship, conversion rate, and revenue per conversion. Such partition reduces model complexity, and allows building better models for rare events such as user clicks and conversions; it also provides better interpretability and easier diagnosis of the models. Our models, however, typically suffer from misspecification. This is not a primary concern as we developed a methodology to select models based not only on their goodness of fit, but mainly on the system’s ability to manage the users’ expectations on the target business metric. Our methodology can be summarized as follows:
      1. A bootstrapping based nonparametric test to find whether or not the combination of models - one for each component of paid-search environment - has predictive power with respect to the final metric of interest.
      2. An algorithm for finding a good combination of models from a candidate set of models for each component of paid-search environment, taking into account the final metric of interest.
      3. Expectation setting for the stakeholders by recommending time to wait before the performance can be measured reliably.

      Contributed Talks

         Abstract: "We describe how to use robust Markov decision processes for value function approximation with state aggregation. The robustness serves to reduce the
      sensitivity to the approximation error of sub-optimal policies in comparison to classical methods such as fitted value iteration. This results in reducing the bounds on the
      γ-discounted infinite horizon performance loss by a factor of 1/(1 − γ) while preserving polynomial-time computational complexity. Our experimental results show that using the robust representation can significantly improve the solution quality with minimal additional computational cost.
         Abstract: "Conditional Value at Risk (CVaR) is a prominent risk measure that is being used extensively in various domains. We develop a new formula for the gradient of
      the CVaR in the form of a conditional expectation. Based on this formula, we propose a novel sampling-based estimator for the CVaR gradient, in the spirit of the likelihood-ratio method. We analyze the bias of the estimator, and prove the convergence of a corresponding stochastic gradient descent algorithm to a local CVaR optimum. Our method allows to consider CVaR optimization in new domains. As an example, we consider a reinforcement learning application, and learn a risk-sensitive controller for the game of Tetris.
         Abstract: "For Markov decision processes with long horizons (i.e., discount factors close to one), it is common in practice to use reduced horizons during planning to speed computation. However, perhaps surprisingly, when the model available to the agent is estimated from data, as will be the case in most real-world problems, the
      policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. The reason, we argue, is that the planning horizon is a complexity control parameter for the class of policies to be learned. We show formally that it has an intuitive, monotonic relationship with a counting measure of
      complexity. The measure gives rise to a bound on the planning loss, predicting that a planning horizon shorter than the true horizon can reduce overfitting and improve
      test performance, and we confirm these predictions empirically.
         Abstract: "Bayesian methods suffer from the problem of how to specify prior beliefs. One interesting idea is to consider worst-case priors. This requires solving
      a stochastic zero-sum game. In this paper, we extend well-known results from bandit theory in order to discover minimax-Bayes policies and discuss when they are practical.
         Abstract: "We consider the problems of learning and planning in Markov decision processes with temporally extended actions represented in the options framework. We propose to use predictions about the duration of extended actions to represent the state and show how this leads to a compact predictive state representation model independent of the set of primitive actions. Then we develop a consistent and efficient spectral learning algorithm for such models."