Lab 12

Lab 12: Markov Decision Processes

In this lab you will practice value iteration for solving MDPs and explore the impact of different parameters on the resulting optimal policy. This lab is largely based on the Berkeley AI Pacman projects (Project 3: Reinforcement Learning, Q1 and Q3).

1. First you will familiarize yourself with the simple grid world MDP. Make sure you have the latest lab code under lab12/. The directory contains many files but you only need to read or edit the ones that are mentioned explicitly in the lab description. The file gridworld.py implements a parametrized grid world in which a dot "robot" can navigate. The state of the world is the robot's current pose. At each time step, the robot takes an action which results in a new (or same) world state. The robot's actions, which are basically attempts to move in an one of four directions, are stochastic; i.e. they may not always have the intended effect. After each action the robot receives a reward (could be zero or negative). Run the following command to see all functionalities and parameters of the MDP implementation.

python3 gridworld.py -h

Next run it in the manual mode and move the robot around using the keyboard arrows.

python3 gridworld.py -m

Finally run a default agent that selects its next action randomly in a maze-like grid and observe the agent. You should see the random agent bounce around the grid until it happens upon an exit. Repeat this a couple times after the agent terminates.

python3 gridworld.py -g MazeGrid

The Gridworld MDP is such that you first must enter a pre-terminal state (the double boxes shown in the GUI) and then take the special 'exit' action before the episode actually ends (in the true terminal state called TERMINAL_STATE, which is not shown in the GUI). If you run an episode manually, your total return may be less than you expected, due to the discount rate (use option -d to change from the default value of 0.9).

Look at the console output that accompanies the graphical output (or use -t for all text). You will be told about each transition the agent experiences (to turn this off, use -q).

2. Next write a value iteration agent in ValueIterationAgent, which has been partially specified in valueIterationAgents.py. Your agent will take an MDP on construction and run value iteration for the specified number of iterations before the constructor returns. Note that this agent is an offline planner (not a reinforcement learning agent). The number of iterations to be run can be specified using option -i. Value iteration computes k-step estimates of the optimal values of states, Vk. In addition to computing these values, implement the following methods for ValueIterationAgent using Vk.

  • computeActionFromValues(state) computes the best action according to the value function given by self.values.
  • computeQValueFromValues(state, action) returns the Q-value of the (state, action) pair given by the value function given by self.values.

These quantities are all displayed in the GUI: values are numbers in squares, Q-values are numbers in square quarters, and policies are arrows out from each square indicating the action to be taken by the agent in that state.

You should use the "batch" version of value iteration where each vector Vk is computed from a fixed vector Vk-1 (like in lecture), not the "online" version where one single weight vector is updated in place. This means that when a state's value is updated in iteration k based on the values of its successor states, the successor state values used in the value update computation should be those from iteration k-1 (even if some of the successor states had already been updated in iteration k). The difference is discussed in Sutton & Barto in the 6th paragraph of chapter 4.1.

A policy synthesized from values of depth k (which reflect the next k rewards) will actually reflect the next k+1 rewards (i.e. you return πk+1). Similarly, the Q-values will also reflect one more reward than the values (i.e. you return Qk+1). You should return the synthesized policy πk+1.

Hint: Use the util.Counter class in util.py, which is a dictionary with a default value of zero. Methods such as totalCount should simplify your code. However, be careful with argMax: the actual argmax you want may be a key not in the counter!

Make sure to handle the case when a state has no available actions in an MDP (think about what this means for future rewards).

The following command loads your ValueIterationAgent, which will compute a policy and execute it 10 times. Press a key to cycle through values, Q-values, and the simulation. You should find that the value of the start state (V(start), which you can read off of the GUI) and the empirical resulting average reward (printed after the 10 rounds of execution finish) are quite close.

python3 gridworld.py -a value -i 100 -k 10

Hint: On the default BookGrid, running value iteration for 5 iterations should give you the output shown in Figure 1:

python3 gridworld.py -a value -i 5

To test your implementation, run the autograder:

python3 autograder.py -q q2

Figure 1: Correct values computed on the Bookgrid after five iterations.

3. Lastly, you will explore the impact of different MDP parameters on the policies that result from value iteration. Consider the DiscountGrid layout, shown in Figure 2. This grid has two terminal states with positive payoff (in the middle row), a close exit with payoff +1 and a distant exit with payoff +10. The bottom row of the grid consists of terminal states with negative payoff (shown in red); each state in this "cliff" region has payoff -10. The starting state is the yellow square.

We distinguish between two types of paths: (1) paths that "risk the cliff" and travel near the bottom row of the grid; these paths are shorter but risk earning a large negative payoff, and are represented by the red arrow in the figure below. (2) paths that "avoid the cliff" and travel along the top edge of the grid. These paths are longer but are less likely to incur huge negative payoffs. These paths are represented by the green arrow in the figure below.

You will choose settings of the discount, noise, and living reward parameters for this MDP to produce optimal policies of several different types. Your setting of the parameter values for each part should have the property that, if your agent followed its optimal policy without being subject to any noise, it would exhibit the given behavior. If a particular behavior is not achieved for any setting of the parameters, assert that the policy is impossible by returning the string 'NOT POSSIBLE'.

Here are the optimal policy types you should attempt to produce:

3a. Prefer the close exit (+1), risking the cliff (-10)

3b. Prefer the close exit (+1), but avoiding the cliff (-10)

3c. Prefer the distant exit (+10), risking the cliff (-10)

3d. Prefer the distant exit (+10), avoiding the cliff (-10)

3e. Avoid both exits and the cliff (so an episode should never terminate)

The functions question3a() through question3e() should each return a 3-item tuple of (discount, noise, living reward) in analysis.py.

To check your answers, run the autograder:

python3 autograder.py -q q3

You can check your policies in the GUI. For example, using a correct answer to 3(a), the arrow in (0,1) should point east, the arrow in (1,1) should also point east, and the arrow in (2,1) should point north. On some machines you may not see an arrow. In this case, press a button on the keyboard to switch to qValue display, and mentally calculate the policy by taking the arg max of the available qValues for each state.

Figure 2: Two different classes of agent behaviors displayed on the DiscountGrid..