Follow the instructions in the comments for the ValueIterationAgent: read learningAgents.py. It has an abstract class ValueEstimationAgent, which the class you are completing for Q1 is a subclass of. It has some info about return types and format for class methods of ValueIterationAgent which may be helpful.
We recommend completing the methods in this order: computeQValueFromValues, computeActionFromValues, runValueIteration.
computeQValueFromValues(state, action) & computeActionFromValues(state)
Use the mdp methods in these functions. The expected types and structure of what they return are given in mdp.py. For example, self.mdp.getTransitionStatesAndProbs(state, action)will return a list of (next_ state, probability) pairs for all the states reachable from state by taking action.
Don’t forget to return None if you are at a terminal state for the MDP.
runValueIteration()
The “batch” version of value iteration simply means we only update what is in self.values after an iteration is done. This means you need to use some kind of temporary variable to store your Uk values during an iteration - for example, a temporary util.Counter variable can be initialized in a loop, filled with your Uk values computed from Uk-1, then after the iteration is done the copy() method of util.Counter can be used to copy its contents to self.values.
You must complete Q1 before doing Q2 and Q3.
You can run DiscountGrid similar to BridgeGrid by:
python gridworld.py -a value -i 100 -g DiscountGrid --discount 0.9 --noise 0.2 – livingReward 0.0
Look at gridworld.py for additional arguments that can be used.
Again, check learningAgents.py for information on the expected return types and format for ReinforcementAgent methods. Note that ReinforcementAgent is a subclass of ValueEstimationAgent.
We recommend completing the methods in this order: __init__, getQValue, computeValueFromQValues, update.
__init__
In __init__, we recommend having an attribute to track the Q values of a QLearningAgent object.
computeValueFromQValues
Consider using self.getLegalActions(state) here – don’t forget to return 0 for terminal states.
computeActionFromQValues
Remember that computeValueFromQValues returns the max Q value – any action with a Q value that is the max value is a viable action. These are the actions you should use random tie breaking between. Don’t forget to return None if you are in a terminal state.
getAction(self,state)
You should return a random legal action with probability epsilon (see Q7 description), otherwise you should return the action given by computeActionFromQValues. Don’t forget to return None if you are in a terminal state.
Update
You will need to update your stored Q values in this function for use in calls of getQValue
After completing Q6, you can complete Q8.
This is already finished for you – it will use your Q6 and Q7 solution. If you cannot pass Q9, this means there is an issue with your Q6 or Q8 solution.