I create a home made environment and classes to play "hide and seek" with several reinforcement learning algorithms:
actor critic
policy gradient
deep reinforcement learing with double-Q learning
An operator is randomly placed into a room, in the room is present a door to escape, a bonus to catch, and a guard. The door, the bonus and the guard are placed randomly as well.
The operator gets multiple rewards based on his actions:
-2 points everytime it makes a move
an additional -1 if it try to move outside the room
if the game last more than 100 actions it gets -10 point and the game ends
+1 point if it hit the bonus
-10 points if it gets caught by the guard, and the game ends
+10 points if it finds the exit door and the game ends
The door is fixed in position, but at every turn operator, bonus and guard can move in one direction.
Schetch of the environment
The reward is then used to train the network in order to maximize the score.
Different algorithms use different approaches:
DDQNuses the exploration method, taking random action every now and then to never loose the chance to learn from an unexplored path
agent policy gradient and actor critic predict a distrubution of probability for the next action and the action taken is choosen with that distribution, in this way there is always a chance of taking an unlike action to explore different solutions
All 3 methods reach a similar high score, beating the game. The Operator takes the decision to move forward the exit leaving behing the bonus if it is too close to the guard, or it goes for it if the risk of losing the game is low.
The network play the game over and over again, learning how to play it perfectly.
Here below is presented a score rate over 5000 games with the DDQN method, the improvement over time is unequivocal.
Score plot, score on the Y and game played on the X
The code can be found in my github repository: