Deep Q Networks and Robotic Arms to play Connect Four

by Sam Zlota

Final Project for CS4910 Special Topics in Computer Science taught by David Klee

Code: https://github.com/sam-zlota/robo-connectfour

MDP Setup

Action Space: There are 7 actions [0,1,2,3,4,5,6,7]

State Space: The board is represented by 6x7 matrix where each element is either empty, taken by player 1 or player 2. There are more than 4 trillion legal states

Reward: -1 for loss, 1 for win

DQN Implementation

NN Architecture: MLP with 2 hidden layers

Observation Representation:

Using the board state is not the best for training as it does not capture which players' turn it is.
Observation:
- 3 * 6 * 7 tensor
- First channel is 6*7 matrix of 0 or 1, if first player has that spot
- Second channel is 6*7 matrix of 0 or 1, if second player has that spot
- Third channel is 6*7 matrix of -1 or 1, -1 for second player, 1 for first player

Training Details:

Exploration
- Some experiments were run where the agent took random moves with a certain probability
Demonstration
- Some experiments were run where the agent took expert moves with a certain probability. This allowed for faster convergence.
Evaluation Metrics
- Optimal play does not necessarily equate to 100% success. An optimal player going first should always win, but an optimal player going second will lose sometimes. I used metrics such as game length and draw rate to see if the agent was learning defensive tactics

Results

Experiment 1: Baseline Results

250,000 steps
25,000 replay buffer size
Epsilon greedy exploration 0.99->0.5

Experiment 2: Effect of longer training

1,250,000 steps
125,000 replay buffer size
Epsilon greedy exploration 0.99->0.5

Experiment 3: Effect of Demonstration

1,250,000 steps
125,000 replay buffer size
Epsilon greedy demonstration 0.99->0.5

Experiment 4: Demonstration with Longer Training

1,700,000 steps before crashing
150,000 replay buffer size
Epsilon greedy demonstration 0.95->0.7

Watch it Play

DQN Agent (YELLOW) goes first and plays Expert Agent (RED)

The DQN Agent optimally takes the middle column as its first move.
It also is able to take a simple strategy of building a tower off the bat. This may not be optimal but it is still intelligent.
This agent was trained for 1.7 Million steps against an expert agents demonstrations

Screen Recording 2022-04-28 at 8.56.29 PM.mov

Robotic Arm

Robotic Setup

Proof of concept:
- Hard code joint angles for each action (drop in column) from a fixed position relative to board
- Fixed grasp procedure
Future stages:
- Incorporate camera to read game state and trigger moves
- Incorporate camera to learn dynamic grasps
Initial Challenges
- Obstacle avoidance
- Robot needed to elevated
- Board Columns too thin

Page updated

Google Sites

Report abuse