Lab 13

Lab 13: Reinforcement Learning

In this lab you will practice Q-learning on the same problem domain as last week's lab and explore sources of stochasticity on the Cozmo robot. This lab is largely based on the Berkeley AI Pacman projects (Project 3: Reinforcement Learning, Q4 & Q5).

1. For this lab you can work under the lab12/ folder which already contains many of the files you need, but be sure to pull additional files. In the last lab, you implemented a value iteration agent that does not actually learn from experience. Rather, it ponders its known MDP model to arrive at a complete policy before ever interacting with a real environment. When it does interact with the environment, it simply follows the precomputed policy (e.g. it becomes a reflex agent). This distinction may be subtle in a simulated environment like a Gridword, but it's very important in the real world, where the real MDP is not available.

In this lab, you will write a Q-learning agent, which does very little on construction, but instead learns by trial and error from interactions with the environment through its update(state, action, nextState, reward) method. A stub of a Q-learner is specified in QLearningAgent in qlearningAgents.py, and you can select it with the option '-a q'. For this question, you must implement the update, computeValueFromQValues, getQValue, and computeActionFromQValues methods.

For computeActionFromQValues, you should break ties randomly for better behavior. The random.choice() function will help. In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, an unseen action may be optimal. Your computeValueFromQValues and computeActionFromQValues functions should access Q values by calling getQValue .

With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard:

python gridworld.py -a q -k 5 -m

The option -k will control the number of episodes your agent gets to learn. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake." Hint: to help with debugging, you can turn off noise by using the --noise 0.0 parameter (though this obviously makes Q-learning less interesting). If you manually steer the agent north and then east along the optimal path for four episodes, you should see the Q-values shown in Figure 1.

We will run your Q-learning agent and check that it learns the same Q-values and policy as our reference implementation when each is presented with the same set of examples. To grade your implementation, run the autograder:

python autograder.py -q q4

Figure 1: Expected Q-values after a few manually controlled actions.

2. Complete your Q-learning agent by implementing epsilon-greedy action selection in getAction, meaning it chooses random actions an epsilon fraction of the time, and follows its current best Q-values otherwise. Note that choosing a random action may result in choosing the best action - that is, you should not choose a random sub-optimal action, but rather any random legal action.

python gridworld.py -a q -k 100 

Your final Q-values should resemble those of your value iteration agent, especially along well-traveled paths. However, your average returns will be lower than the Q-values predict because of the random actions and the initial learning phase.

You can choose an element from a list uniformly at random by calling the random.choice function. You can simulate a binary variable with probability p of success by using util.flipCoin(p), which returns True with probability p and False with probability 1-p.

To test your implementation, run the autograder:

python autograder.py -q q5

With no additional code, you should now be able to run a Q-learning crawler robot:

python crawler.py

This will invoke the crawling robot using your Q-learner. If this doesn't work, you've probably written some code too specific to the GridWorld problem and you should make it more general to all MDPs. Play around with the various learning parameters to see how they affect the agent's policies and actions.

Note that the step delay is a parameter of the simulation, whereas the learning rate and epsilon are parameters of your learning algorithm, and the discount factor is a property of the environment.