LearnPNP Tutorial - Matteo Leonetti


LearnPNP

LearnPNP is a tool for representing and learning agent behaviours based on Petri Net Plans, which in turn is based on Petri Nets . The expressive power of Petri Nets allows us to compactly represent conditional plans with loops, parallel action execution, interrupts, and partial observability.

Where to start

In the rest of this page I'll mainly cover the learning extension to PNP, and I assume the reader is already familiar with the basics of the formalism. If this is not the case, I suggest to get a look at the paper by Ziparo et al. [1] and to the first part of this tutorial. Before you proceed make sure you are able to understand what a plan does by looking at its Petri Net. This is usually quite intuitive and getting the hang of PNP should be pretty easy. If you are interested in a detailed description of LearnPNP you can refer to the paper my advisor and I presented at the RoboCup International Symposium 2010 [3] available on the page of my publications.

Download and Installation

This library is written in C++, and  has been developed and tested on linux. I have not tried it on other platforms yet, but I believe that it can be compiled and installed on any system that provides the basic requirements. Those include, apart of course from a C++ compiler:

Most linux distributions should have those, make sure you have installed their respective development package (usually identified by the postfix -devel) where appropriate. For instance, on my system, libxml2 comes with the two packages "libxml2" and "libxml2-devel". The first one is required to use the library and the second one to compile code that uses it. We shall need both, so make sure both are installed.

 

Download the PNP code from github  https://github.com/iocchi/PetriNetPlans

The installation is as easy as typing the following:

cd pnp
mkdir build
cd build
cmake ../src
make install

If all the requirements are met you should have by now the directory include, with the headers that we are going to use for compilation, and the directory lib, with the file libpnp.so that we are going to use for linking and execution. 
You can copy both to an appropriate location, for instance /usr/include for the former and /usr/lib for the latter. You can also leave them where they are, either way the software that will use the library must have a way to access its location. So, from now on, I assume the existence of two environment variables PNP_INCLUDE and PNP_LIB. To set those just add in your .bashrc file the following lines:

export PNP_INCLUDE=<path to pnp>/include
export PNP_LIB=<path to pnp>/lib

 if you have troubles with linking, just copy libpnp.so into the system directory /usr/lib and that should fix it.

 In order to generate the documentation you need to have doxygen installed, then just type:

cd <your path>/pnp
doxygen

 It'll create a directory named doc with html files describing the library. In the main file (index.html) you can find a summary of the steps to implement a generic domain.

You are now ready to write your first program that uses LearnPNP, I suggest to start from the following tutorials.

Hello Grid Worlds!

Grid Worlds are easy to understand and quite popular in RL papers. For this reason I've implemented a few generic classes for grid worlds that can be specialised in particular problem instances.

You can download the code from here: gridworlds.zip

The compilation is pretty straightforward and not dissimilar to LearnPNP's one:

 cd <path to gridworld>
mkdir build
cd build
cmake ..
make
cd ..
doxygen

The last command generates the documentation, that again can be found in the directory doc

The two main classes that must be sub-classed in order to implement a particular grid world are Grid and World. Their documentation, and the rest of this page, should clarify their role. The next section provides an introduction to implementing domains for LearnPNP using GridWorlds.

Parr and Russell's Grid World

gridworld

 We show how to implement Parr and Russell's Grid World and reproduce the results by Loch and Singh [2] about Sarsa(lambda) in this partially observable environment.

Environment

Grid World is a small domain with eleven states, four actions, and four observations. At any given time, the agent is only able to perceive whether there's a wall on its immediate left and right. The actions are the usual move north, move east, move south, move west which move the agent to the specified location with probability 0.8, while with probability 0.1 move the agent in one of the other two orthogonal directions. So, for instance, if the action chosen is to move north, the agent will actually try to move north with probability 0.8, while will try to move east or west with probability 0.1 each. If the agent tries to move in the direction of a wall it just sits on its current state.

In the following we are going to use the definition of the problem as an episodic task provided by Perkins [4]. The agent receives a reward of -0.04 each step, except for when it enters one of the two terminal states shown in the picture, where the reward is +1 and -1 respectively. The episode terminates when the agent enters a terminal state or after 100 steps and the environment is reset to its initial state.

Implementation

Before following the usual steps for LearnPNP (most of which are characteristic of PNP as well) we must implement our little domain. I'll consider the bottom left cell to be (0,0) and the top right (3,2). So, I've just sub-classed Grid and defined the method obstacle() to return true for the coordinates x<0 or x>3, y<0 or y>2, and for x==1 and y ==1. This means that the agent is not able to go anywhere outside the grid and in the obstacle at (1,1).

You can find parr::ParrGridWorld, and the other classes mentioned in this page, in the source code of gridworlds introduced above. Please do read the code and the documentation as you skim over this tutorial.

1. The plan

The PNP executor can load Petri Nets in the Petri Net Markup Language (PNML) format. The two tools I have used, that save the net in that format, are jarp and pipe . You can use either one, but you won't be able to open the file saved by one with the other. In the rest of this tutorial I am going to use jarp.

Writing the plan is the first thing to do because doing so you will define actions and conditions required during execution. The plan for this problem is represented in the next figure and contained in the file plans/parr/grid.pnml.

 

plan

 Since the plan represents a reactive policy it tests the four conditions LF_RF, LO_RF, LO_LO, LF_RF first, and then for each of those cases decides which of the four actions available must be executed. The four conditions encode the observations, in particular: LF_RF is the case in which the left-hand side and the right-hand side are both free of obstacles; LF_RO is the case in which the left-hand side is free and the right-hand side has an obstacle, and so on. The four actions available are ActMove_{N,E,W,S} and a separate value will be learned for each of them (a value function is associated with the markings of the Petri Net [3]). The representation could be more compact but for now I'll just keep it simple.

So we need the four actions and those four conditions. Let's move to the next step.

2. The actions

I created a class parr::MoveAction to do the job of our four actions. The constructor allows to specify the increment along both axes so that, for instance, parr::MoveAction(0,1) will create an instance of MoveAction that implements move north. MoveAction extends PnpAction and redefines the methods executeStep(), that executes one step of a possibly extended action, and finished() that tells the executor when this actions has terminated. In our case, since the action is instantaneous, finished() returns always true. Moreover, executeSteps() uses Grid::move() to implement the stochastic action required by the domain.

 3. The Condition Checker (and Reward Collector)

This class allows the executor to check whether the conditions specified in the plan hold during the execution. The expressions are formed by atomic conditions and logical operators. PNP supports the classical operators AND, OR, and NOT in prefix notation. Brackets may be used to enforce the precedence among operators and the most external parentheses of the guards are always square brackets. So, for instance, [and ATOM1 ATOM2] would be a condition, and [or (and ATOM1 ATOM2) (and ATOM3 ATOM4)] would be another. The executor has a built-in parser to analyse the conditions at run-time. When it gets to an atomic condition it asks a ConditionChecker, which basically acts as the bridge between the executor and perceptions. In our case, we must specify a Condition Checker to return the value of our five atomic conditions: LF_RF, LO_RF, LO_LO, LF_RF, and FinalState.

The condition checker for this domain is parr::ParrReward. It extends learnpnp::RewardCollector which in turn extends PetriNetPlans::ExternalConditionChecker. Therefore, it works as a normal Condition Checker (see the method evaluateAtomicCondition()) for a PnpPlan and also provides the method reward() used by a LearnPlan.

The method ParrReward::reward() is called by the plan every time the marking changes. Not every change in the marking corresponds to an action in the environment, so reward() rerturns 0 most of the times. When an action did happen though, it returns 1 or -1 if the agent is in one of the final states, or -0.04 otherwise.

Since this is the interface between the executor and the representation of the environment (or its perceptions) it must have access to such a representation. Being our case particularly simple, it just has a reference to the object that implements the grid.

4. The Instantiator

This class is responsible for the creation of the objects that represent plans and actions.  In my experience this class tends to grow ugly quite fast, so I've separated it in two parts: BaseInstantiator and parr::ParrInstantiator. The core method to implement is BaseInstantiator::createExecutable(). It is called by the executor passing the name of the action or the plan to be created, as it is written in the calling plan. In our case, for instance, it'll be called with the parameter equal to "ActMove_N" when such an action must be started.

The executor does not distinguish between plans and actions so they are generically returned as PnpExecutables. Since the actions depend on the particular domain the method BaseInstantiator::createExecutable() delegates this to the method createAction() that is defined by ParrInstantiator. For the same reason it declares the method createController() to return the learning algorithm.

The general implementation of createExecutable() will: check whether the required executable is an action and if it is return it; if it is a plan, create a controller for the plan, load it from the pnml file and then return it.

The methods ParrInstantitor::createAction() and createController() allow to tailor this general behaviour for our domain. For this tutorial, ParrInstantiator uses a simple LoggingController, composing a learning algorithm (TDLambda) and an exploration policy (ActionEGreedy). ActionEGreedy is an epsilon-greedy strategy where epsilon starts at a specified value and goes to zero in the specified number of actions. A LoggingController can be configured to log to a file the value of particular markings at the end of each episode, and will be useful to understand how the value function changes during a run.

5. The executor

The last step is the instantiation of the executor. This can be found in the file main.cpp.

To recap

We have implemented our domain with 4 classes: ParrGrid to specify the topology of the grid; MoveAction that can execute all the actions available to the agent;  ParrReward to be the interface between the executor and the environmet: it returns the value of conditions and the reward accumulated since the last marking change;  ParrInstantiator to load the plans and instantiate the actions. After loading a plan from its PNML file, ParrInstantiator also chooses the controller that determines the learning algorithm used.

Let's give it a go

The file parameters.h defines a few numbers we need to set before we start.

The typedef is just a way to choose the domain among those possibly available. Through that the proper subclass of World will be used to locate the files. LogPlaces should be set to true, since it determines whether the logger will actually create the log files. This feature can be turned off to speed up the execution.

An epoch is a period during which learning and evaluation alternate. The length of those is regulated by learningPeriods and numSamples respectively. The total number of learning episodes (this doesn't take evaluation into account) can be set by numepisodes and the number of epochs by numepochs.

We are going to run 20 epochs of 2000 learning episodes each, in which every 200 episodes learning is paused and the current policy is evaluated for 100 episodes.

Let's compile and run gridworlds:

cd gridworlds/build
cmake ..
make
./gridworlds

 During the execution it'll print out the percentage of its progression and the number of actions performed up till then during that epoch. In the directory plan/parr/ a few files will be created.

grid_.txt has the value function. You can see the id, value, and marking of each state stored. It is initialised copying the file specified in World::vInitFile() that by default is the name of the plan plus "-init". I have provided a few of these files with various initial values, for instance grid-init-4.txt is set at -4. Just rename (or copy) the one you want to use into grid-init.txt.

places_<epoch>.txt is the output of the logger. It has a line for each experiment of that epoch during which learning was active, and a column for each marking specified in the code (refer to ParrInstantiator::setLogger()).

reward-parr-<timestamp>_<epoch>.txt has a line for each episode in which learning was not active, that is for the evaluation of the policies. Since we decided to evaluate our policies for 100 episodes (see the file parameters.h) this file must be interpreted in blocks of 100.

Plotting results

In order to have the average value of our policies in the points in which we have evaluated them we must average over all the reward-parr* files considering the lines in groups of 100. I've written a script for R that does so, therefore if you install R and just run:

R
> source("computeMeans.txt")
>q()
>n

where with ">" I mean the R's prompt. If you change the number of samples for each evaluation edit the file computeMeans.txt changing the first variable accordingly.

In the directory /plan/parr it'll create a file means.txt with a line for each evaluation point. The first evaluation is made before any learning takes place, then every 200 episodes, so for 10000 episodes there are 51 lines (10000/200 + 1 where the 1 is the evaluation at the beginning).

To plot this file you can use any spreadsheet or similar tool, I'll use gnuplot. The following commands, again in a shell in the directory /plans/parr/

gnuplot
>set key bottom right
> plot "means.txt" using (($1-1)*200):3:4 title "reward" with yerrorlines

 should produce something like this graph:

meansEpisodes

while if you use the second column for the x axis  it plots the reward over the average number of actions

> plot "means.txt" using 2:3:4 title "reward" with yerrorlines

meansActions

Notice the different scale on the x axis. The graph is not particularly smooth, but it would probably take more than 20 epochs to make it so.

To have an idea of what goes on at choice points during one particular epoch we can use the files places* that the logger has created. Before being able to do that, however, we need to know which column belongs to which action.

The way to know this is a bit tedious: we must dig into the pnml file describing the plan and read the id of the places we are interested in. This is what I've done to decide which places to log to put them in ParrInstantiator::setLogger().


The following table shows the execution places (in PNP terms) associated to each action for each of the four observations:

 north
west
south
 east

left occupied
right free

 p11 p12 p13 p14
left free
right free
 p3 p4 p5 p6
left free
right occupied
 p17 p18 p19 p20
left occupied
right occupied
 p23 p24 p25 p26



So, if we want to know what happend, for instance, in the initial observation (which doesn't correspond only to the initial state) we can plot the places of the first row.

Using gnuplot on the first epoch:

gnuplot
> set key bottom right
> plot "places_1.txt" using 0:2 title "north" with lines, '' u 0:3 t "west"  w l,'' u 0:4 t "south"  w l, '' u 0:5 t "east"  w l

The result should be something like this:

value function for the actions available at a certain observation

In this particular run the action move east has had the highest value untill approximately the 3000th episode, when move north has correctly passed it. At about episode 8000 the epsilon of our epsilon-greedy strategy reaches zero, and from there on only the action with the highest value is executed.

Changing the columns plotted from the file places_1.txt you can get a look at the other choice points in the other observations. Just open the file and read the order of the columns, then put them in the command for gnuplot. For instance, say we want to plot the value of the observation for the states where left and right are both free of obstacles.
We look at our table and see that we need the places from p3 to p6. Then we look inside the file places_1.txt and see that those places correspond to the columns from 14 to 17. We modify the command for gnuplot accordingly:

> plot "places_1.txt" using 0:14 title "north" with lines, '' u 0:15 t "west"  w l,'' u 0:16 t "south"  w l, '' u 0:17 t "east"  w l

which gives something like:

Another choice point

where we can see that move east is the action correctly chosen in the final policy.

As the last plot, we get a look at the value of p1 during the first epoch. When the token of the Petri Net is in p1, that one is the initial marking, so its value corresponds to the estimate for the total reward. Let's plot it:

> plot "places_1.txt" using 0:1 title "estimated reward" with lines

The value function in the initial marking

As we can see it converges to about 0.25 which is the value of the optimal policy.

References

[1] V. A. Ziparo, L. Iocchi, D. Nardi, P. F. Palamara, and H. Costelha.
Petri net plans: a formal model for representation and execution of multi-robot plans.
In AAMAS ’08: Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems, pages 79–86, Richland, SC, 2008. International Foundation for Autonomous Agents and Multiagent Systems.

[2] J. Loch and S. Singh
Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes,
in Proceedings of the Fifteenth International Conference on Machine Learning, pp. 323–331.

[3] M. Leonetti, L. Iocchi
LearnPNP: A Tool for Learning Agent Behaviors
Proceedings of the RoboCup International Symposium 2010, Springer Verlag, 2010

[4] T.J. Perkins
Reinforcement learning for POMDPs based on action values and stochastic optimization
in Proceeding of the national conference on artificial intelligence, pp. 199–204. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, (2002).