GridWorld is a simple and commonly used environment in reinforcement learning (RL) to illustrate and test various RL algorithms. It is a grid-based environment where an agent navigates through a grid to achieve certain goals, potentially encountering rewards or penalties along the way.
A GridWorld with walls
A GridWorld with hazards (indicated by -1)
A GridWorld with different kinds of hazards (holes, bugs, crabs) and with rewards (flowers)
GridWorld typically consists of:
A grid of cells, where each cell can be empty, an obstacle, a start state, a goal state, or contain a reward or punishment.
The agent starts in a specific cell and must navigate to reach a goal cell.
States:
Each state corresponds to a unique position of the agent in the grid (e.g., the agent is at cell (2,3)).
The state space is finite and consists of all possible positions in the grid.
Actions:
The agent can move in four possible directions: up, down, left, or right.
Some actions might be invalid if they lead to a cell outside the grid or into an obstacle.
Rewards are assigned to specific states or actions. For example, reaching the goal state might provide a positive reward, while stepping into a hazardous state might provide a negative reward.
The reward function can be defined to incentivize certain behaviors, like reaching the goal quickly.
The objective of the agent is to maximize its cumulative reward over the duration of the task.
Dowload the ReinforcementLearning folder from [here].
In VS Code, use File > Open Folder to open the ReinforcementLearning folder.
You will implement your Q-learning agent in the qLearningAgents.py file in the Exercises folder
There are 5 method that you will need to implement for your Q-Learning agent:
This function will return the q-value of the given state-action pair, by looking it up in the q-table.
However, we are not going to actually use a table (or a 2D list) for this.... it is easier and more efficient if we use dictionaries!
Notice that there is a self.qVals variable that is initialized to an empty dictionary. This is where you will hold the q-values.
You will set it up so that the keys in this dictionary are states. So, given some state, you can look up the q-values associated with that state by doing self.qVals[state].
Then, each state will have a q-value for every action, so we will set this up as another dictionary where you can look up the q-value for any action by its name: self.qVals[state][action].
In other words, when you do self.qVals[state] you will get something back that looks like this:
{ 'North': 2.34 , 'South': 1.12, 'East': 0.34, 'West': 0, 'Stop': 0.57, 'Exit': 0 }
IMPORTANT:
getQValue(state, action) should return the q-value for the given state-action pair, if this state is in the dictionary/qtable.
However, if this state is not in the dictionary yet, because this is the first time we have seen this state, then getQValue() should add this state to the dictionary/qtable and set all of the q-values for every action to 0. Make sure you include all 6 actions that are listed above. And then return 0.
These functions are where you will select the action for the agent to take, given the current state of the game. You will select the action that has the highest q-value.
You can first use self.getLegalActions(state)to get a list of which actions are legal for this state. If the agent is up against a wall, and 'West' is not a legal action from this state, for example, then 'West' will not be in the list of legal actions.
Based on the current state and the legal actions, you will use getQValue() to get the q-values for each of the state-legalAction pairs. You then want to select the legalAction with the highest q-value.
In computeActionFromQValues() you will return the selected action.
In computeValueFromQValues() you will return the selected action's q-value.
Note: In computeActionFromQValues(), you can get better behavior if you break ties randomly. random.choice(list) will return a random choice from a list.
The getAction() function will call self.computeActionFromQValues(), but in this function is where we will add the epsilon-greedy strategy which helps ensure the agent continues to explore, and not just exploit.
In this function, you will return a random action with probability self.epsilon, and you will return the action with the highest q-value the rest of the time. (You find the action with the highest q-value in computeActionFromQValues()).
How do you make something happen with a probability of epsilon?
You can use random.random() to generate a random number between 0-1. Then you can check whether the generated number is above or below epsilon!
How do you pick a random action?
You can use random.choice(list) to select a random item from list.
Here is where you will update the q-values based on rewards and punishments received.
This is where we use the Bellman Equation:
Here's a breakdown of the terms in this equation:
Q(st, at) is the current Q-value for the state-action pair: look this up in your Qtable/dictionary
alpha is the learning rate: use self.alpha
rt+1 is the reward
gamma is the discount factor: use self.discount
maxa' Q(st+1, a') is the maximum Q-value for the nextState (over all possible actions for nextState): get this from your Qtable/dictionary
Now it is time to run your agent in GridWorld and wach it learn!
The GridWorld we will start with looks like this:
In this GridWorld, the agent will start in the bottom left corner.
There are two end states, indicated in this picture by the green and red squares. The green end location gives a positive reward and the red end location gives a negative reward.
No other locations give rewards.
The gray square is a wall.
We want the agent to learn that the goal is to navigate from the bottom left to the upper right, without going onto the red square.
You can watch as your agent tries many different actions and learns this task.
As the agent moves, the display will show the Q-values for every state-action possibility, indicating the quality of each direction from each possible state.
As your agent learns more and more about the task, the best path will start to become brighter green, indicating higher q-values along that path, and the values that become brighter red indicate that the agent has learned those are bad actions to choose.
An example of an agent learning what path to take.
To run your agent in GridWorld use:
python3 gridworld.py -a q -k 100
The last number (100 in this case) will control the number of episodes your agent gets to learn.
If you want to try your agent on other GridWorlds, you can change the GridWorld by running these:
python3 gridworld.py -a q -k 100 -g MazeGrid
python3 gridworld.py -a q -k 100 -g CliffGrid
python3 gridworld.py -a q -k 100 -g BridgeGrid