Source: https://hub.gke2.mybinder.org
Environment Cliffworld with the size 4x12. The agent starts at the green position,
At each time step the agent can move one step.
The agent cannot get out of the map
The game is end when
The agent reaches the goal state (black)
The agent steps on one of the pink squares. If so, she falls off a cliff.
At each time step, the agent receives a negative reward of -1. If the agent falls off the cliff he gets a penalty of -100
So the simple way to reach the destination is to iteratively move to the new position based on some strategies until reach the destination or falling into the cliff.. In our case of RL, the strategy is to maximize the reward. The way to design the reward is very interesting here. Since it is negative, the more moving steps an agent has, the lower reward value it obtains. Thus, to maximize the reward, it needs to reach the destination as soon as possible.
We use namedtuple to define all the state of the map. Before doing that we need to import the following packages
import random
from collections import defaultdict, namedtuple
from itertools import product, starmap
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
#from IPython.display import Image, YouTubeVideo
from scipy import stats
Define states:
#define the environment based on namedtuple
# the following uses the namedtuple function to create the State class:
State = namedtuple('State', ['x', 'y','con'])#x: row, y: col, con: condition
#this will contain all states
all_states =[]
rows =4
cols=12
#I will create an environment where start location is set 2
#destination is set 1
#cliff is set -1
#others are set 0. I use condition to know the state condition.
for i in range(rows):
for j in range(cols):
if i==0 and j ==0:
#start state
all_states.append(State(i,j,2))
elif i==0 and j==11:
#destination state
all_states.append(State(i,j,1))
elif i==0 and j%2==0:
#cliff states
all_states.append(State(i,j,-1))
else:
all_states.append(State(i,j,0))
def draw_map(all_states):
for state in all_states:
#start
if state.con==2:
data[state.x,state.y]=4
#destination
elif state.con==1:
data[state.x, state.y]=2
#cliff
elif state.con==-1:
data[state.x, state.y]=1
else:
data[state.x, state.y]=0
sns.heatmap(data,cmap=states_colors, cbar=False, square=True, linewidths=1, fmt='')
plt.show()
Define the movement steps:
steps={'<':(0,-1),
'>':(0,1),
'^':(-1,0),
'v':(1,0)}
We create a class named CliffWorld as the environment. We need some following methods
get _reward(state, action): We want to know how much reward we can receive based on a state and action
get_next_state(state, action): We want to know what is the next state based on the current state and the select action
get_valid_actions(state): from a state, what is possible actions we can take.
log(): save all movement
class Agent:
def __init__(self, env):
self.env = env
The agent acts with the environment based on the current state to select an action in order to move toward the destination. Based on some strategies, an agent can select an action. For example, an easy way to move is to randomly select an action until reach the destination or fall into the cliff.
def select_random_action(self, state):
valid_actions = self.env.get_valid_actions(state)
return random.choice(valid_actions)
Now we randomly run until falling into the cliff or reaching the destination.
def run_episode(agent):
current_state= agent.env.start_state
while not (current_state.con==-1 or current_state.con==1):
action = agent.select_random_action(current_state)
next_state=agent.env.get_next_state(current_state,action)
reward =next_state.con
#agent.env.log(current_state, action, reward, next_state)
current_state=next_state
agent.log(current_state, action, reward, next_state)
print(agent.env.record_list)
agent.env.draw_log()
env = CliffWorld(all_states[0])
agent = Agent(env)
run_episode(agent)
Learning a model means using experience to estimate the actual mechanics of an environment. This means being able to plan, to know the possible consequences of a decision before it's executed.
An important advantage of this approach is that its use of the available data/experience is very efficient. The biggest downside is that planning in general is computationally quite expensive.
Model based approaches could be favourable for applications like robotics where it's difficult to collect large amounts of data but the slowness of the robot's actions leaves enough time for the necessary computations to use the model to plan a sequence of actions and its consequences.
The approach explored in the following is fundamentally different from using a model. Value based learning means to use experience to estimate a function which assignes a value to every possible state action pair telling the learner how good this pairing is in terms of maximizing the long term return of rewards. Even though later models of biological learning (like the work by [Rescola and Wagner](http://www.scholarpedia.org/article/Rescorla-Wagner_model)) are even closer to value based algorithms used today even classical conditioning, first discovered by Ivan Pavlov in his famous [dog experiment](https://en.wikipedia.org/wiki/Ivan_Pavlov) where a dog begins to relate some conditioned stimulus (the ringing of a bell) to a following unconditioned stimulus (food) is surprisingly similar to the Q-learning algorithm implemented in the following.
The goal is to find a sequence of actions which maximizes the expected return of rewards in the gridworld. So maybe it is enough to learn from experience how "good" it is to take some action in a particular state. If it's possible to learn a function which expresses this by assigning a value to every state action pair it's possible to just take the highest valued action in every state and finally find the goal.
But how?
For every action the agent receives a reward. Sometimes there's a discount on this reward, so that that reward on the next step is favorable to a reward in the distant future.
At time step t the return $G_t$ is the sum of all future rewards from that time step on
$$G_{t} = r_{t+1}+r_{t+2}+r_{t+3}+\dots +r_{n}$$
In the discounted case, rewards get multiplied with some $\gamma<0$ raised to the power of the number of time steps this reward is away from the next reward. So if a reward is $k$ time steps aways it's discounted by $\gamma^{k-1}$:
$$ G_{t} = r_{t+1}+ \gamma r_{t+2}+ \gamma^2 r_{t+3}+ \dots = \sum_{k=0}^{\infty}\gamma^k r_{t+k+1} $$
Because the future is unknown, the return has to be expressed in a different way. It is possible to understand the return as the sum of just two parts:
$$G_{t}= r_{t+1} + \gamma G_{t+1}$$
The return at time stept $t$ is the immediate reward on the next time step $r_{t+1}$ plus the discounted return from this time step on.
This can be used to introduce a function, $Q(s,a)$ , which assigns some value to every possible action in each state which tells the agent how good this action is in terms of maximizing the return $G_{t}$.
The Q function can be split in the same manner as the return $G_{t}$. The maximum value for taking some action $a$ in state $t$ is the sum of the reward for taking that action and the maximum (discounted) value for taking the optimal action on the next time step.
$$Q(s_t,a_t)= r_{t+1} + \gamma max_{a_{t+1}} Q(s_{t+1}, a_{t+1})$$
The agent has to learn $Q(s, a)$ which maps all state action pairs to a value which has to be as close as possible to the *true* value of a state action pair. If a good estimate of the true Q-values is known, the agent just has to take the highest valued action in every state.
This leads to the learning rule of the Q-learning algorithm (the use of capital letters indicates actual tabular values):
$$Q(S_{t}A_{t})\gets Q(S_{t}A_{t})+\alpha[R_{t+1}+\gamma \max_{A}Q(S_{t+1},A)-Q(S_{t},A_{t})]$$
The agent has some estimate of the value of taking action $A_t$ in state $S_t$. Now he executes this action and receives the reward $R_{t+1}$ in return. And because of the definition above $Q(s_t,a_t)$ can be updated with the new information. The direction of the update is determined by the immediate reward the agent receives plus the (discounted) difference between the maximum of the estimate $Q(S_{t+1})$ of the state we now reached times some small step size parameter $\alpha$.
class Agent:
def __init__(self, env):
self.env=env
def __init__(self, env, alpha, epsilon, gamma):
self.env = env
self.epsilon = epsilon
self.alpha = alpha
self.gamma = gamma
self.Q=defaultdict(int)
self.A=defaultdict(set)
self.td_list = []
def get_espilon_action(self, state):
valid_actions= self.env.get_valid_actions(state)
if np.random.random() > self.epsilon:
action = self.get_best_action(state)
else:
action = self.get_random_action(state)
return action
def learn(self,state, action, reward, next_state):
temp_td = self.td(state, action, reward, next_state)
self.Q[state, action]= self.Q[state,action]+self.alpha*temp_td
def td(self, state, action, reward, next_state):
max_action = self.get_best_action(state)
temp_td= reward + self.gamma*self.Q[state,max_action]-self.Q[state,action]
self.td_list.append(temp_td)
return temp_td
def get_best_action(self, state):
#print(self.Q)
valid_actions= self.env.get_valid_actions(state)
max_value=-1
max_action=valid_actions[0]
for action in valid_actions:
next_state=self.env.get_next_state(state, action)
if max_value<self.Q[next_state, action]:
max_value=self.Q[next_state, action]
max_action=action
#if state ==all_states[0]:
# print("state",state,"max action ", max_action, "max Q:", max_value)
return max_action
def get_random_action(self, state):
valid_actions = self.env.get_valid_actions(state)
return random.choice(valid_actions)
def log(self,state,action, reward, next_state):
self.env.record_list.append((state,action, reward, next_state))
self.env.reward_sum+=reward