Ecological Reinforcement Learning

John D. Co-Reyes* Suvansh Sanjeev* Glen Berseth Abhishek Gupta Sergey Levine

UC Berkeley

* Equal Contribution

Abstract

Much of the current work on reinforcement learning studies episodic settings, where the agent is reset between trials to an initial state distribution, often with well-shaped reward functions. Non-episodic settings, where the agent must learn through continuous interaction with the world without resets, and where the agent receives only delayed and sparse reward signals, is substantially more difficult, but arguably more realistic since real-world environments do not present the learner with a convenient ``reset mechanism" and easy reward shaping. In this paper, instead of studying algorithmic improvements that can address such non-episodic and sparse reward settings, we instead study the kinds of environment properties that can make learning under such conditions easier. Understanding how properties of the environment impact the performance of reinforcement learning agents can help us to structure our tasks in ways that make learning tractable. We first discuss what we term ``environment shaping" -- modifications to the environment that provide an alternative to reward shaping, and may be easier to implement. We then discuss an even simpler property that we refer to as ``dynamism," which describes the degree to which the environment changes independent of the agent's actions and can be measured by environment transition entropy. Surprisingly, we find that even this property can substantially alleviate the challenges associated with non-episodic RL in sparse reward settings. We provide an empirical evaluation on a set of new tasks focused on non-episodic learning with sparse rewards. Through this study, we hope to shift the focus of the community towards analyzing how properties of the environment can affect learning and the ultimate type of behavior that is learned via RL.

Motivation

Reinforcement learning is normally studied in the episodic setting where the agent is reset each episode. This makes learning easier but in the real world, we would like our agent to continually learn with minimal human supervision and without having to manually reset the agent each time it makes a mistake. Reset-free or non-episodic learning is difficult, especially with sparse reward where the agent may never experience any rewarding states and not make any progress. Without any algorithmic changes however, certain properties of the environment can make learning without resets and with sparse reward more tractable. We investigate and analyze these properties: environment shaping and environment dynamism.

Environmental Properties:

Environment Shaping alters the initial state or dynamics of the non-episodic training MDP to make learning more tractable compared to an unshaped environment. For example if the agent is tasked with eating apples a shaped environment may initially contain an abundance of easily obtainable apples that allows the agent to learn that apples are rewarding. As the easily obtainable apples are consumed, the agent must eventually learn to reach apples that are further away and take more steps to reach such as climbing a ladder up a tree. A shaped environment can be thought of as a natural curriculum for the non-episodic setting.
Environment Dynamism refers to the MDP transition entropy regardless of the agent's actions and provides a soft uniform reset mechanism for the agent, helping it reach a wider variety of states in the non-episodic setting. A static environment might correspond to a very controlled setting where no other entity or part of the environment changes on its own while a dynamic environment would involve the opposite of this. Dynamics environments can be found readily in the real world (humans and other agents provide natural dynamism) and so we may just need to deploy our agents in these existing settings.

Experiments

We build environments and tasks to investigate learning in the non-episodic spare reward setting and investigate the effect of these properties on learning performance.

Salad-Making Task

In this task, the agent must acquire and combine two vegetables to make a salad.

Hunting Task

In this task, the agent must hunt the deer with an axe. Only the axe can be picked up by the agent.

Factory Task

In this task, the agent must collect materials from moving workers and combine them to create an axe.

Scavenging Task

In this task, the agent must collect food for sustenance while evading predators that damage its health.

Unity Food Collector

In this task taken from Unity's ML Agents package, the agent must collect green, healthy food, while avoiding red, poisonous food.

Reset-Free Learning in Dynamic Environments

In this section, we examine the behavior of agents trained using a dense distance-based reward on the tool-making, hunting, and Unity food collector tasks, in both static and dynamic settings, where the dynamicity refers to changes in the environment that do not result from the agent's actions. We find that, to a large extent, dynamic environments alleviate the challenges associated with non-episodic learning. The lesson that we might draw from this is that, although individual properties of natural environments (such as non-episodic learning) can make the learning process harder, combining these properties (i.e., as in the non-episodic dynamic setting) can actually alleviate these challenges, since the dynamics of the environment naturally cause the agent to experience a variety of different situations, even before it has learned to take meaningful and coordinated actions. The behavior shown below is on validation tasks, which are drawn from the same distribution for the agents trained in both the static and dynamic setting.

Hunting Task

Behavior comparison: Below, we compare the state visitation counts during training for agents trained in the dynamic and static environments, both in the non-episodic setting. We find that the dynamic environment results in a more balanced state visitation distribution in later epochs, with the center area of the grid showing greater visitation values than in the static environment.

Static Environment: These gifs show the performance on the validation task of an agent trained in a static environment in a non-episodic setting, wherein resources are regenerated only when they are depleted, and deer do not move. The reward used is distance-based, wherein the reward provided is the negative scaled L1 distance to the nearest useful resource, and bonuses are provided for interacting with these resources. A larger bonus is provided for picking up the food.

In our experiments, the trained agent solved the validation task 0% of the time.

Failure: The agent fails to interact with either resource.

Failure: The agent gets stuck in a corner.

Heat map of state visitation count across validation rollouts

Dynamic Environment: These gifs show the performance on the validation task of an agent trained in a dynamic environment in a non-episodic setting, wherein deer move at each timestep with probability 0.2. The reward used is distance-based, wherein the reward provided is the negative scaled L1 distance to the nearest useful resource, and bonuses are provided for interacting with these resources. A larger bonus is provided for picking up the food.

In our experiments, the trained agent solved the validation task 72% of the time.

Success: The agent completes the task, aided by the deer's movement into the agent's path.

Failure: The agent makes two attempts to the hunt the deer, with the latter successful.

Heat map of state visitation count across validation rollouts

These visitation counts are taken from throughout the training process of the agent trained in the static environment in the non-episodic setting. This agent's experienced state distribution is concentrated in the corners and particularly low in the center of the grid, potentially inhibiting generalization of the agent.

These visitation counts are taken from throughout the training process of the agent trained in the dynamic environment (resource generation probability 0.1) in the non-episodic setting. This agent's experienced state distribution, while also highest in the corners, is more uniform than in the static environment, indicating an advantage of dynamic environments for non-episodic learning.

Unity Food Collector Task

Static Environment: These videos show the learned performance of agents in a static, non-episodic environment setting wherein food items are regenerated only when they are depleted, and they do not move. The reward used is sparse, wherein the reward provided is +1 when the agent consumes healthy food, -1 when the agent consumes poisonous food, and 0 at all other times.

In our experiments, the trained agent's net collected food (healthy food minus poisonous food) was ~0.2 per 1000 timesteps.

Success: Due to convenient placement of the healthy food items, the agent successfully collects several of them, though it frequently takes roundabout paths to the food even when near it.

Failure: Due to the static nature of the environment, and the difficult placement of food items, the agent is "blocked" for an extended period of time by the poisonous food item that it is trying to avoid, causing it to get stuck, unable to move to the healthy food.

Dynamic Environment: These videos show the learned performance of agents in a dynamic, non-episodic environment setting wherein food items are regenerated only when they are depleted, and they move at all times. The reward used is sparse, wherein the reward provided is +1 when the agent consumes healthy food, -1 when the agent consumes poisonous food, and 0 at all other times.

In our experiments, the trained agent's net collected food (healthy food minus poisonous food) was ~1.8 per 1000 timesteps.

Success: The agent successfully consumes several healthy food items while avoiding poisonous food items, including a precise and careful maneuver in a situation where a poisonous food item was adjacent to a healthy food item.

Dynamicity to the rescue: Initially, there is no food in the agent's partially observed view, and the agent is unable to take meaningful action. The benefit of dynamicity to learning is demonstrated by the healthy food bouncing off the wall and closer to the agent, at which point the agent is "unstuck" by the environment.

Reset-Free Learning with Environment Shaping vs Reward Shaping

We compare the learned behavior under environment shaping and reward shaping and find that environment shaping outperforms reward shaping in the long run. Improper reward shaping can alter the optimal policy, thereby biasing learning and resulting in a solution that is worse with respect to the desired performance measure, which typically corresponds to the sparse reward. Interestingly, we find that environment shaping works better for the more difficult hunting task. As task complexity grows, so does the difficulty of constructing an unbiased shaped reward for the tasks. In this case, environment shaping benefits from its ease of use and general applicability to various tasks.

Differences in Training between Environment Shaping and Reward Shaping

These visitation counts are taken from throughout the training process of the distance shaped agent in the episodic setting. This agent's experienced state distribution is concentrated in the corners and particularly low in the center of the grid, potentially inhibiting generalization of the agent.

These visitation counts are taken from throughout the training process of the environment shaped agent with a sparse reward in the episodic setting. This agent experiences a near-uniform distribution over grid locations. This visitation count serves as a proxy for the agent's experienced state distribution and suggests an explanation for the greater performance on the validation tasks.

Robustness of Environment Shaping over Reward Shaping

Walled Tool-Making Task

Here, the agent has the same objective as in the tool-making task with the added difficulty of walls often blocking the shortest path to resources. This environment is designed to highlight the robustness of environment shaping compared to reward shaping, and the performance results can be seen in Fig. 7. Here, we display the state visitation counts of the agent throughout training to explain the results seen there.

These visitation counts are taken from throughout the training process of the distance shaped agent in the non-episodic setting. This agent's experienced state distribution is concentrated in the corners and particularly low in the center of the grid, potentially inhibiting generalization of the agent.

These visitation counts are taken from throughout the training process of the environment shaped agent with a sparse reward in the episodic setting. While this agent also spends a high proportion of its time in the corner, the distribution is markedly less concentrated than that of the distance-reward shaped agent. This environment was built to highlight the disadvantages of the distance-based reward.

Hunting Task

Environment Shaping: These gifs show the performance on the validation task of an agent trained using environment shaping, wherein all generated resources are placed in the vicinity of the agent to encourage interaction with them. The reward used is sparse: it is provided only when the agent has picked up the food and zero otherwise. The deer movement probability is set to 0.2 in all validation tasks.

In our experiments, the trained agent solved the validation task 40% of the time.

Success: The agent obtains the axe and chases the deer in two concerted efforts after the deer dodges the first.

Success: The agent tracks down the moving deer and successfully hunts it, picking up the resulting food.

Failure: The agent procures the axe, but the deer remains out of view and the agent cannot find it.

Heat map of state visitation count across validation rollouts

Reward Shaping: These gifs show the performance on the validation task of an agent trained using distance-based reward shaping, wherein the reward provided is the negative scaled L1 distance to the nearest useful resource, and bonuses are provided for interacting with these resources. A larger bonus is provided for picking up the food.

In our experiments, the trained agent solved the validation task 22% of the time.

Success: The agent makes two efforts at hunting the deer, the second successful.

Failure: The agent hovers first near the axe and then the deer, indicating a bias from the distance-based reward.

Failure: As the deer moves, the agent moves so as to remain near the midpoint of the resources.

Heat map of state visitation count across validation rollouts

Unity Food Collector Task

Environment Shaping: These gifs show the training performance of an agent trained using environment shaping, wherein all generated resources are placed in the vicinity of the agent to encourage interaction with them, and the radius in which resources are placed is increased over time, easing the agent into the fully general task. The reward used is sparse, wherein the reward provided is +1 when the agent consumes healthy food, -1 when the agent consumes poisonous food, and 0 at all other times.

In our experiments, the trained agent's net collected food (healthy food minus poisonous food) was ~3.6 per 1000 timesteps.

Success: Since food is regenerated near the agent, it can experience the reward associated with collecting the healthy or poisonous food repeatedly during training, providing it more meaningful experience earlier on during training and resulting in faster and more successful learning in the non-episodic setting. The agent can also be seen avoiding poisonous food near the end.

Success: This video displays a higher rate of increase of the radius in which food is regenerated in order to demonstrate clearly the form of environment shaping used, wherein the agent is weaned off of this support by gradually decreasing the support provided by the environment. Again, the agent's meaningful experiences are denser in time as a result of the environment shaping.