Chentanez, Nuttapong, Andrew G. Barto, and Satinder P. Singh. "Intrinsically motivated reinforcement learning." Advances in neural information processing systems. 2005.
Factoring the environment into an external environment and an internal environment, the latter of which contains the critic which determins primary reward. These are simply stimuli transduced by the internal environment so as to generate the appropriate level of primary reward.
J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv:1703.01732, 2017.
[RL] A reinforcement learning agent uses experiences obtained from interacting with an unknown environment to learn behavior that maximizes a reward signal. The optimality of the learned behavior is strongly dependent on how the agent approaches the exploration/exploitation trade-off in that environment. If it explores poorly or too little, it may never find rewards from which to learn, and its behavior will always remain suboptimal; if it does find rewards but exploits them too intensely, it may wind up prematurely converging to suboptimal behaviors, and fail to discover more rewarding opportunities
[Existing] Heuristic exploration strategies: epsilon-greedy action selection or Gaussian control noise -> inadequate when rewards are especially sparse. The failure modes in all of these cases pertained to the nature of the exploration: the agents encountered reward signals so infrequently that they were never able to learn reward-seeking behavior.
[Intrinsic] more complex heuristics: efficient and scalable exploration strategies that maximize a notion of an agent’s surprise about its experiences via intrinsic motivation.
One approach to encourage better exploration is via intrinsic motivation, where an agent has a task-independent, often information-theoretic intrinsic reward function which it seeks to maximize in addition to the reward from the environment. Examples of intrinsic motivation include empowerment, where the agent enjoys the level of control it has about its future; surprise, where the agent is excited to see outcomes that run contrary to its understanding of the world; and novelty, where the agent is excited to see new states
In this work, we build on that success by exploring scalable measures of surprise for intrinsic motivation in deep reinforcement learning.
We formulate surprise as the KL-divergence of the true transition probability distribution from a transition model which is learned concurrently with the policy, and consider two approximations to this divergence which are easy to compute in practice.
[Contributions]1. we investigate surprisal and learning progress as intrinsic rewards across a wide range of environments in the deep reinforcement learning setting, and demonstrate empirically that the incentives (especially surprisal) result in efficient exploration, 2. we evaluate the difficulty of the slate of sparse reward continuous control tasks introduced by Houthooft et al. [7] to benchmark exploration incentives, and introduce a new task to complement the slate, 3. and we present an efficient method for learning the dynamics model (transition probabilities) concurrently with a policy.
We distinguish our work from prior work in a number of implementation details: unlike Bellemare et al. [2], we learn a transition model as opposed to a state-action occupancy density; unlike Stadie et al. [22], our formulation naturally encompasses environments with stochastic dynamics; unlike Houthooft et al. [7], we avoid the overhead of maintaining a distribution over possible dynamics models, and learn a single deep dynamics model.
Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang, ``Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation'', to appear in Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), full paper, Long Beach, CA, USA, June 16-20, CVF/IEEE. PDFBIB (Oral paper acceptance rate: ~5%)
Being trained with the intrinsic reward from the matching critic and the extrinsic reward from the environment, the reasoning navigator learns to ground the natural language instruction on both local spatial visual scene and global temporal visual trajectory.
Burda, Yuri, et al. "Large-scale study of curiosity-driven learning." arXiv preprint arXiv:1808.04355 (2018). [pdf]
[RL] Reinforcement learning algorithms rely on **carefully engineering** environment rewards that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. In RL, the agent policy is trained by maximizing a reward function that is designed to align with the task. Most of the success in RL has been achieved when this reward function is **dense and well-shaped**, e.g., a running “score” in a video game The rewards are extrinsic to the agent and specific to the environment they are defined for.
[Challenge] However, designing a well-shaped reward function is a notoriously challenging engineering problem
An alternative to “shaping” an extrinsic reward is to supplement it with dense intrinsic rewards [26], that is, rewards that are generated by the agent itself.
"Curiosity": which uses prediction error as reward signal
"Visitation Counts": which discourage the agent from revisiting the same states.
[Curiosity] Curiosity is a type of intrinsic reward function which uses prediction error as reward signal.
The idea is that these intrinsic rewards will bridge the gaps between sparse extrinsic rewards by guiding the agent to efficiently explore the environment to find the next extrinsic reward. We want to incentivize this agent with a reward rt relating to how informative the transition was.
An agent trained to maximize this reward will favor transitions with high prediction error, which will be higher in areas where the agent has spent less time, or in areas with complex dynamics
[Modules] In particular, we choose the dynamics-based curiosity model of intrinsic reward. The central idea is to represent intrinsic reward as the error in predicting the consequence of the agent’s action given its current state, i.e., the prediction error of learned forward-dynamics of the agent
Exploration by Random Network Distillation[arxiv][code]
As pointed out by many authors, agents that maximize such prediction errors tend to get attracted to transitions where the answer to the prediction problem is a stochastic function of the inputs. For example if the prediction problem is that of predicting the next observation given the current observation and agent’s action (forward dynamics), an agent trying to maximize this prediction error will tend to seek out stochastic transitions, like those involving randomly changing static noise on a TV, or outcomes of random events such as coin tosses. This observation motivated the use of methods that quantify the relative improvement of the prediction, rather than its absolute error. Unfortunately, as previously mentioned, such methods are hard to implement efficiently.
RND incentivizes visiting unfamiliar states by measuring how hard it is to predict the output of a fixed random neural network on visited states. In unfamiliar states it’s hard to guess the output, and hence the reward is high. It can be applied to any reinforcement learning algorithm, is simple to implement and efficient to scale.
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms. One of the key difficulties is insufficient exploration, resulting in an agent being unable to learn robust policies. Intrinsically motivated agents can explore new behavior for their own sake rather than to directly solve external goals. Such intrinsic behaviors could eventually help the agent solve tasks posed by the environment.
Towards Sample Efficient Reinforcement Learning Yang Yu IJCAI 2018 [paper]
Particularly, the decision making for a long-term goal requires the intelligence of long-term vision and less greedy behaviors;
By reinforcement learning, an agent interacts with the environment, explores the unknown area, and learns a policy from the exploration data. In a common setting, the exploration data contains environment state transitions associated with the exploration actions and reward signals.
From the data, the quality of the policy can be evaluated by the reward. Reinforcement learning algorithms update the policy model from the evaluations, with the aim of maximizing the reward in total.
From the perspective of policy modeling, these algorithms can be categorized as value function estimation algorithms and policy search algorithms.
The former ones estimate a value-function to approximate the long-term reward from the current state and action. The policy is then derived from the value-function straightforwardly. The latter ones directly learn the policy model. Recent algorithms focus more on learning policy models with the help of value-functions, known as actor-critic approaches, inheriting the merits of the both.
A noticeable limitation of current reinforcement learning techniques is the low sample efficiency, which causes a huge amount of interactions with the environment.
In an unknown environment, the agent needs to visit states that have not been visited in order to collect better trajectory data. The agent cannot follow its current policy tightly, which has been learned from the previous data and may only lead to follow the previous paths.
Exploration strategies are usually employed to encourage veering off the previous paths. Basic exploration methods such as e-greedy and Gibbs sampling inject some randomness in the output actions, i.e., action space noise, so that the probability of executing every action, and thus visiting every state, is non-zero. A limitation of action space noise is that the resulting policy (i.e., the latent policy corresponds to the randomized output) may be far away from the current policy in the parameter space, or even out of the parameter space, which makes difficulties to the policy update.
Curiosity-driven exploration. All the above exploration strategies are generally applicable, however, are all blind searches. The agent may repeatly try a bad path many times, since it does not know if the path has been explored before. This might be a major reason that the current general reinforcement learning algorithms require a lot of samples — find a good path by luck. Curiosity-driven exploration [Singh et al., 2004] can be much more efficient than random exploration. The agent records the counts of the visits of every states and actions. According to the counts, an intrinsic reward is added to the environment reward to encourage visiting states that are less visited. This kind of approaches have been addressed a decade ago, where the state space and action space are small and discrete. For high-dimensional state space, an obstacle of implementing the curiosity-driven exploration is that it is hard to if a state has been visited before. Recently, Pathak et al. [2017] proposed the Intrinsic Curiosity Module (ICM) to overcome this obstacle. It employs the state prediction error as a measure to determine if a state has been visited. Meanwhile, it employs the self-supervision to learn a low-dimensional representation of the states. Intrinsic reward, however, is a delayed feedback to drive the agent. Mechanisms that directly encourage the exploration might be desired.