Protocol

This page will describe the RL-Glue protocol: both abstractly and at the technical level.

The RL-Glue is kept simple so as to maximize the ease with which environments and agents can be written. The Glue consists of a small number of routines that must be defined, plus optional routines that provide additional functionality or convenience. Given these, the Glue provides a further set of routines that can be used to write general experiment program.

The RL-Glue presented here assumes that there is only one agent and one environment existent at the same time. (Although it would be natural to generalize this to multiple agents and multiple environments in an object-oriented fashion, it was not done in this version so as to maximize simplicity and language independence.) A user will bring together three things: a learning agent, an environment, and an experiment or test that he would like to run on their combination. Each would be compiled and then combined into one executable, which it then runs, perhaps several times with parameter variations. The Glue does not make strong assumptions about the experiment program. It just provides routines for interconnecting the agent and environment, leaving it up to the experiment writer how to use them.

Typically, a experiment program involves averaging over a sequence of independent runs, each starting with a naive (before learning) agent and proceeding through a number of episodes or a single long episode. A performance measure is computed for each run (e.g., the average reward per episode on the final 100 episodes) and then averaged over runs to produce an overall performance measure for this agent-environment combination. An informal example is given below. The agent and environment must be such that each run is completely independent of the others. In particular, the agent cannot in any way use experience on earlier runs to influence its performance on later runs. The agent should define agent_init in such a way that this is true.

In what follows, we use the term "observation" for the information returned by the environment on each time step. An important special case is that in which the observation is the state of the environment. The general case, which we treat here, includes partially-observable Markov decision processes.

Episodic and continuing tasks

An episodic task is one in which the agent-environment interaction is divided into a sequence of trials, or episodes. Each episode starts in the same state, or in a state chosen from the same distribution, and ends when the environment reaches a terminal state, which it signals by returning a special terminal observation. The environment must not retain any state from episode to episode---it must generate observations with the same probability distribution on every episode with the same history of observations and actions (since the beginning of the episode). The agent, on the other hand, is normally expected to change state across episodes via its learning process.

Formally, an episodic environment is any environment that might generate the terminal observation. A terminal agent is an agent that can respond appropriately to the terminal observation (by appropriately implementing agent_start and agent_end).

Formally, a continuing task is one in which there is one episode that starts once and goes on forever.

Environment routines

Every environment (plant, simulator) must implement the following two routines.

env_start() --> first_observation  

For a continuing task, this is done once. For an episodic task, this is done at the beginning of each episode. Note no reward is returned. In the case of an episodic environment, end-of-episode is signaled by a special observation. This special observation cannot be returned by env_start.

env_step(action) --> reward, observation

Do one time step of the environment.

No other functionality is required from the environment. The routines described below are optional and need only be implemented if the environment writer finds them convenient or desires the additional functionality.

env_init() --> task_specification

This routine will be called exactly once. It can be used to initialize the environment and/or to provide, as its returned value, a specification of its i/o interface - the space of actions that the environment accepts and the space of rewards and observations it returns. The task_specification is optional and will be made available to the agent via the routine agent_init. See the proposal for a task specification language.

env_get_state() --> state_key

Saves the current state of the environment such that it can be recreated later upon presentation of state_key. The state_key could in fact be the state object, but returning just a key (a logical pointer to the state information) saves passing the state back and forth and avoids giving the agent direct access to the state.

env_set_state(state_key)

Restores the environment to the state it was in when state_key was obtained. Generates an error if state_key was not previously generated by env_get_state with this environment.

env_get_random_seed() --> random_seed_key

Saves the random seed object used by the environment such that it can be restored upon presentation of random_seed_key. Same comments as above for env_get_state.

env_set_random_seed(random_seed_key)

Restores the random seed used by the environment such that the environment will behave exactly the same way it has previously when it was in this state and given the same actions. Typically used in conjunction with env_set_state. Generates an error if random_seed_key was not previously generated by env_get_random_seed with this environment.

env_cleanup()

This routine is called once per call to RL_init. RL_init may allocate memory or other resources that will be released by this routine.

Agent routines

Every agent (controller) must implement the following two routines.

agent_start(first_observation) --> first_action

Do the first step of a run or episode. Note that there is no reward input.

agent_step(reward,observation) --> action

Do one step of the agent.

If an agent is to be used with episodic environments (environments that return terminal observations) then it must implement the following routine.

agent_end(reward)

Do the final step of the episode.

If multiple runs will be made with the agent, then it must be returned to its initial pre-learning state prior to each run. The following routine is called at the beginning of each run.

agent_init(task_specification)

Initializes the agent to a naive (pre-learning) state. The task_specification, if given, is a description on the environment's i/o interface according to a standard description language. The agent may ignore the task_specification.

If memory or other resources are allocated when the agent is initialized, then the following routine can be used to release them.

agent_cleanup()

Interface routines provided by the RL-Glue

Experiment writers will typically access the RL Glue entirely through the interface routines described in this section. These routines are meant to never be changed by users, but to be a permanent, defining part of the RL Glue. They are implemented by appropriate calls to the agent and environment routines described in the preceeding sections.

The interface routines can be used to write a variety of specific Experiment programs, examples of which were noted earlier. Python-like psuedocode is given here for each routine to suggest its specific relationship to the agent and environment routines. To understand the following, it is helpful to think of an episode as consisting of observations, actions, and rewards that are time-step indexed as follows:

o0, a0, r1, o1, a1, r2, o2, a2, ..., rT, terminal_observation

where the episode lasts T time steps (T may be infinite) and terminal_observation is a special, designated observation signaling the end of the episode.

RL_init()  
     agent_init(env_init())

Initialize everything, passing the environment's i/o specification to the agent.

RL_start() --> o0, a0

     global upcoming_action
     s = env_start()
     a = agent_start(s)
     upcoming_action = a
     return s,a

Do the first step of a run or episode. The action is saved in upcoming_action so that it can be used on the next step.

RL_step() --> rt, ot, at

     global upcoming_action
r,s = env_step(upcoming_action)
if s == terminal_observation
     agent_end(r)
     return r, s
else

a = agent_step(r, s)

     upcoming_action = a

return r, s, a

Do one time step. RL_step uses the saved action and saves the returned action for the next step. The action returned from one call must be used in the next, so it is better to handle this implicitly so that the user doesn't have to keep track of the action. If the end-of-episode observation occurs, then no action is returned.

RL_episode(steps) --> o0, a0, r1, o1, a1, ..., rT

or o0, a0, r1, o1, a1, ..., rsteps, osteps, asteps

s, a = RL_start()
list = [s, a]
while s != terminal_observation:
r, s, a = RL_step()
list = list + [r, s, a]
return list minus last two elements

Do one episode until termination or until steps steps have elapsed, whichever comes first. As you might imagine, this is done by calling RL_start, then RL_step until the terminal observation occurs. The psuedocode shown is specific to the case in which the episode is completed in less that steps steps.

RL_return() --> return

Return the cumulative total reward of the current or just completed episode. Any discounted must be done inside the environment.

RL_num_steps() --> num_steps

Return the number of steps elapsed in the current or just completed episode.

RL_cleanup()
     env_cleanup()
     agent_cleanup()

Provides an opportunity to reclaim resources allocated by RL_init.

Experiment programs

Given all of the above, users will write experiment programs that produce clear performance measures, perhaps something like the average return over 1000 episodes, averaged again over 100 runs:

RL_experiment() --> performance
     performance = 0
     for run = 1..100
          RL_init()
          sum = 0
          for episode = 1..1000
               RL_episode(10000000)
               sum = sum + RL_return(1.0)
          performance = performance + sum/1000.0
          RL_cleanup()
     return performance/100.0

The idea is to provide one overall measure of performance defining the experiment.