# Credit Assignment in DL and DRL

## Credit assignment in Deep Learning and Deep Reinforcement Learning Workshop

## ICML 2018

## Saturday July 14- Sunday, July 15, 2018

**Stockholm, Sweden**

## Call for papers - https://openreview.net/group?id=ICML.cc/2018/ECA

### Deadline: 15 June, 2018

We accept both short paper (4 pages) and long paper (8 pages) submissions. A few papers may be selected as oral presentations, and the other accepted papers will be presented in a poster session. There will be no proceedings for this workshop, however, upon the author’s request, accepted contributions will be made available in the workshop website. Submission is single-blind and open to already published work.

We are interested in submissions that deal with the issue of credit assignment: how the parameters, actions, or states of a system can be changed to produce some downstream effect. The work can be experimental, analytical, or theoretical. We are also open to work in progress. Synthetic Gradients, Sparse Attentive Backtracking, Equilibrium Propagation and UORO are examples of the kinds of work that we'd consider to be highly relevant

We welcome submissions related to the following topics:

- Alternatives to Backpropagation for training deep networks
- New ways of assigning credit to actions in reinforcement learning (e.g. temporal difference learning, eligibility traces)
- Biologically plausible methods for learning
- Exploration of the properties of credit assignment through gradient descent.

We also identify several pieces of previous work related to credit assignment:

- Synthetic Gradients
- Sparse Attentive Backtracking
- Equilibrium Propagation
- Unbiased Online Recurrent Optimization (UORO)
- Evolution Strategies

## Workshop Description

Deep Learning has enabled massive improvements in areas as diverse as computer vision, text understanding, and reinforcement learning. A key driver of this progress has been the backpropagation algorithm. Credit assignment is ultimately about how changing the model’s behavior could have led to improved outcomes.

Credit assignment is a critical part of Deep Learning and Reinforcement Learning, yet the field has spent surprisingly little effort thinking about its flavors. No complete solutions exist and those that exist stay well short of the way humans solve credit assignment problems. While backpropagation has been highly successful, it has profound limitations. One is that the time to compute estimates of the gradient does not scale well in the size of the computational graph. Another is that there is no way to perform online parameter estimation without truncation (bptt), which leads to biased estimates of the gradient.

This research problem, while in its infancy, has already seen significant contributions. For example, the Synthetic Gradient algorithm (Jaderberg 2016) aims to modify the training procedure to allow for decoupled updates. The “Unbiased Online Recurrent Optimization” algorithm (Tallec 2016) showed that RNNs can be learned in an online fashion by using a low-rank approximation to forward-mode automatic differentiation. The Sparse Attentive Backtracking algorithm (Ke 2017) modifies the backpropagation algorithm to be efficient for long sequences by using a hard attention to mechanism to selectively backtrack through a small number of salient time steps in the past. We believe that a workshop would be a great place to allow these contributions and more novel ideas to shine.

Even coming up with better formal definitions for the credit assignment problem would be an important potential outcome of this workshop. Additionally, the relationship between approaches from the Deep Learning and Reinforcement Learning communities is not well understood. As such, it seems that rethinking credit assignment is underappreciated, under-researched, and promises to improve algorithms considerably. Rethinking credit assignment will require the involvement of many communities spanning cognitive science, computational complexity, and deep learning.

## Speakers:

- Martha White, University of Alberta
- Doina Precup, McGill University
- David Silver, Deepmind, University College London
- David Duvenaud, University of Toronto
- Blake Richards, University of Toronto.
- Timothy Lillicrap, Deepmind, University College London
- Sepp Hochreiter, Johannes Kepler University Linz
- Claudia Clopath, Imperial College, London
- Jurgen Schmidhuber, IDSIA
- Theaophane Weber, Deepmind
- Joel Veness, Deepmind
- Jane Wang, Deepmind

# Workshop Schedule

## July 14, 2018

14:00 - 14:30 - Doina Precup

14:30 - 15:00 - Matti Herranen

15:00 - 15:30 - Joel Veness

15:30 - 16:00 - Coffee break

16:00 - 16:30 - Sepp Hochreiter

16:30 - 17:15 - **Contributed Talks**

** - **Timothy Lillicrap, **Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures**

** - ** Benjamin Scellier, **Equivalence of Equilibrium Propagation and Recurrent Backpropagation**

** - **Tim Cooijman, REINFORCE as Approximate Real-Time Recurrent Learning

## July 15, 2018

8:45 - 9:00 - Opening Remarks

9:00 - 9:30 - David Silver

9:30 - 10:00 - Martha White

10:00 - 10:30 - Coffee Break/Posters

10:30 - 11:00 - Theophane Weber

11:00 - 11:30 - Claudia Clopath

11:30 - 12:00 - David Duvenaud

12:00 - 12:30 - Blake Richards

12:30 - 14:00 - Lunch Break

14:00 - 14:30 - Timothy Lillicrap

14:30 - 15:00 -** Contributed Talks - **Martin Klissarov** Diffusion-Based Approximate Value Functions**

** **Asier Mujika **Approximating Real-Time Recurrent Learning with Random Kronecker Factors**** **

15:00 - 15:30 - Jurgen Schmidhuber

15:30 - 16:00 - Coffee Break

16:00 - 17:30 -** Panel Discussion with Yoshua Bengio, Timothy Lillicrap, Blake Richards, Jane Wang!**

## Abstract for Talks

### Knowledge representation for efficient credit assignment in reinforcement learning (Doina Precup)

In this talk, I will argue that in addition to the mechanics of temporal credit assignment algorithms, it is important to focus on the way in which we represent knowledge in order to make credit assignment possible. I will underline the role of temporal abstraction mechanisms in this process.

### Deep Learning, Reinforcement Learning, and the Credit Assignment Problem (David Silver)

A major issue in machine learning is how to assign credit for an outcome over a sequence of steps that led to the outcome. In reinforcement learning, the issue is how to assign credit over a sequence of actions leading to cumulative reward. In deep learning, the issue is how to assign credit over a sequence of activations leading to a loss. Temporal difference (TD) learning is a solution method for the credit assignment problem that can be applied in both cases. The first part of the talk will focus on reinforcement learning: specifically, on how to learn the meta-parameters of TD learning. The second part of the talk will focus on deep learning: specifically, how to use TD learning with synthetic gradients as a principled alternative to error backpropation.

### Beyond Backprop (Jurgen Schmidhuber)

“Modern” backpropagation (Linnainmaa, 1970) is now widely used. Many tasks, however, cannot be solved by backprop. I give examples where credit assignment can be achieved or greatly improved through other methods such as artificial evolution, compressed network search, universal search, the Optimal Ordered Problem Solver, meta-learning.

### Backward View and Reward Redistribution for Delayed Rewards (Sepp Hochreiter)

Most reinforcement learning approaches rely on a forward view to predict the expected return. Examples are Monte Carlo and temporal difference methods like SARSA or Q-learning which estimate the state-value or value function, policy gradients using value or advantage functions, and Monte Carlo Tree Search. However the forward view faces problems with probabilistic environments and with high branching factors of the state transitions since it has to average over all possible futures. These problems become more severe for delayed rewards. The number of paths to the reward grows exponentially with the delay steps; the reward information must be propagated further back; averaging becomes more difficult; the variance of many values of state-action pairs is increased. We suggest avoiding these problems by a backward view where episodes that have been observed are analyzed. We avoid probabilities and guessing about possible futures, while identifying key events and important states that led to a reward. The backward view allows for a reward redistribution which largely reduces the delays of the rewards while the expected return of a policy is not changed. The optimal reward redistribution via a return decomposition gives an immediate feedback to the agent about each executed action. If the expectation of the return increases then a positive reward is given and if the expectation of the return decreases then a negative reward is given. We introduce RUDDER, a return decomposition method, which creates a new MDP with same optimal policies as the original MDP but with redistributed rewards that have largely reduced delays. If the return decomposition is optimal, then the new MDP does not have delayed rewards and TD estimates are unbiased. In this case, the rewards track Q-values so that the future expected reward is always zero. On artificial tasks with different lengths of reward delays, we show that RUDDER is exponentially faster than TD, MC, and MC Tree Search (MCTS).

### Learning to attend, attending to learn: Modulating auxiliary unsupervised costs with attention (Matti Herranen)

Augmenting a primary task with an unsupervised task is common practice in, for instance, reinforcement learning and semi-supervised deep learning. However, if a lot of the structure that the unsupervised task is trained on is irrelevant for the primary task, the unsupervised task might not support the primary task. We propose to use the gradient of the output of the primary task to derive an attention signal which modulates the cost function used for the auxiliary unsupervised task. This is applicable in cases where the unsupervised cost is applied at a lower level of the network. The proposed modulation, or attention, is shown to significantly improve semi-supervised learning with the Ladder networks in two datasets with ample irrelevant structure for the primary task.

### Self-tuning Gradient Estimators through Differentiable Surrogates (David Duvenaud)

We show how to learn low-variance, unbiased gradient estimators for any function of random variables. When applied to reinforcement learning, this approach gives a generalization of Advantage Actor-Critic which is pseudo-action-dependent and has more stable training dynamics. Our approach is based on gradients of a neural net surrogate to the original function, tuned during training to minimize the variance of its gradients.

### RL in the brain: The role of neuromodulation in learning (Claudia Clopath)

### An RNN Architecture using Value Functions (Martha White)

Effectively training RNNs remains an important open problem, where the typical strategy is to use truncated backpropagation through time (BPTT), which is sensitive to truncation. In this talk, I will highlight that an alternative RNN architecture, composed of value function predictions about the future, is significantly easier to train without using BPTT. Further, using eligibility trace methods for training these value functions can significantly improve learning speed empirically, suggesting this architecture is one strategy for benefiting from credit assignment strategies in RL for training of RNNs.

### Improved Credit Assignment in Stochastic Computation Graphs (Theophane Weber)

Stochastic Computation Graphs provide a common formalism to represent the computation arising from a variety of models from supervised, unsupervised, and reinforcement learning. In particular, an unbiased estimator of the gradient of the expected loss of such models can be derived from a single principle. While unbiased, this estimator often has high variance, especially in the cases where reparametrization is impossible. In this work, we detail alternative estimators for the gradient by borrowing ideas from the reinforcement learning literature such as baselines and critics, and show how they generalize known results in the literature.

### Towards a gradient agnostic metric of efficient credit assignment (Blake Richards)

Abstract: Efficient credit assignment is a concept that is critical to deep learning, but which has yet to receive a formal definition. As such, whether or not a given learning algorithm achieves efficient credit assignment is often determined by comparing it to gradient descent. According to this practical metric, gradient descent is the effective definition of efficient credit assignment. However, depending on the learning goals, there can be instances where efficient credit assignment need not be understood via gradient descent, including the use of long-term memory and meta-learning. Here, I propose that efficient credit assignment is potentially best understood as an optimal control problem, where the system to be controlled is the evolution of the learning agent over parameter updates. I propose that efficient credit assignment could potentially be formalized by examining the extent to which the entire state space of the agent's learning dynamics can be controlled by external data and/or teaching signals.

## Organizers

**Anirudh Goyal (MILA, University of Montreal)****Alex Lamb (MILA, University of Montreal)****Nan Rosemary Ke (MILA, University of Montreal)****Jonathan Binas (MILA, University of Montreal)****Aaron Courville (MILA, University of Montreal)****Konrad Kording (UPenn)****Yoshua Bengio (MILA, University of Montreal)**