Backgammon

and AI

By Tate Munnich

Introduction

Evaluating positions with heuristic functions

Looking Ahead

Temporal Difference Learning

My Implementation of TD-learning

My Results

Introduction

In this project I explored a few popular approaches to artificial intelligence play in backgammon. At any given point in a match, backgammon is a game of perfect information; both players can observe the entire state of the game. For other games of perfect information such as chess or go, computer programs have had notable success against human experts by looking ahead in the game tree to determine the strongest move. However, the dice in backgammon add a much larger element of complexity. Even the strongest computers are unable to look more than 3 or 4 moves ahead, so researchers needed to come up with a way to evaluate board positions more accurately. The question is, how can we evaluate the strength of a backgammon position? What is the probability of winning the game at this position, and which move will maximize that probability?

Check out the basic rules of backgammon if you're not familiar, or if you prefer learn by doing, go here to get some reps in.

Evaluating positions with heuristic functions

Heuristic functions are commonly used in artificial intelligence methods to evaluate search spaces. A common example is in pathfinding algorithms: say you are in a square maze, and you know that the exit is in the southeast corner. A rational strategy would be to generally head in the direction of the exit. There may be barriers in the way but your intuition leads you in the right direction— this is analogous to using a distance heuristic.

Ideally, there would be a similar approach to playing backgammon. One very basic starting point is to count the number of spaces your pieces have to travel to eventually be taken off the board. These are called pips. In figure 1, black has five pieces with 4 more spaces to get off, four with 3 or 2, and two with 1. So we would say that it has 5*4 + 4*3 + 4*2 + 2*1 = 42 pips. Similarly, red has 172. In this position black has a virtually insurmountable lead.

Figure 1, black's pip count is 42, red's is 172.

Comparing pips can help determine which player is ahead in certain situations, but often simply counting the number of pips often fails to differentiate a good move from a bad one. Let's look at the opening roll 3-1. Consider the two moves black chose in figures 2a and 2b. I'll use the shorthand 8/5 6/5 to denote one piece moving from the 8th position to the 5th position and one piece moving from the 6th to the 5th.

Figure 2a, black moves 8/5 6/5

Figure 2b, black moves 6/5 6/3

8/5 6/5 in figure 2a is the better move, played by nearly 100% of experts. However, when we compare the pip counts for black, we'll see that they're equal (163 for those counting). If a player were making their moves solely based on the pip counts, they would have no preference between these two moves. In fact, for any move that black chooses, the resulting pip count will be 163, because the rules dictate that a player move the distances on both dice if they are able.

If we were trying to improve the instructions given to a player, how should they be able to show that the position in figure 2a is greatly superior to that in 2b? Would there be a way to do this by only counting black's pips?

Looking Ahead

The answer is yes, by looking ahead an additional move. While black's pip count is the same after each of the moves, what is it likely to be after red moves? In position 2a, because black has no single pieces it cannot be hit, so its pip count must remain at 163. But in position 2b, if red rolls a 2 or a 4, or two numbers that sum to 2 or 4, black would be hit and sent to the bar. In the worst case, both pieces are hit and black's pip count is increased to 205 (pieces on the bar have a pip count of 25). Alternatively, red could possibly hit only one of the open pieces, or they may not hit either.

Figure 3 shows this visually in a game tree.

Figure 3, a simplified game tree for black rolling 3-1 to start. Values below are black's pip counts after each scenario. Note that the complete tree would be much larger; black has 14 other ways to move, for a total of 16 ways to play 3-1 at the start.

The expected value of each move can be exactly calculated using the probabilities of each scenario. For 8/5 6/5, because the probability that red hits none is 1, the expected pips for black after red moves is 163. To calculate the expected pips for 6/5 6/3, we must examine which rolls can lead to which scenarios. We always assume that the opponent will make the move that is the worst for us, so even though they could choose not to hit one of black's checkers, we assume that they will if they are able.

Figure 2b, again for reference

Now we can calculate the expected pips for black after moving 6/5 6/3. For each scenario, we take the probability of that scenario occurring and multiply that by black's pips after that scenario. Then we add these together to get the expected number for black's pips. The chances of rolling any doubles is 1/36 and any non-doubles is 2/36.

black's expected pips after 6/5 6/3 = (4/36)*205 + (8/36)*185 + (11/36)*183 + (13/36)*163 ≈ 178

This strategy will now give a reason to prefer moving 8/5 6/5 to 6/5 6/3. In artificial intelligence speak, this technique is called 2-ply expectiminimax. The 2-ply refers to looking ahead two half-moves, and if you're looking for a more formal definition of the algorithm, click here.

Technically, there is nothing stopping us from running 3- and 4- and even 5-ply expectiminimax to help evaluate a position. However, the search space increases exponentially, so a computer may have to run for hours on end to evaluate a single position with 5-ply.

Ideally we would come up with a way to calculate a player's chance of winning given a certain position, because then all one would have to do is to choose the move which maximizes their chance of winning, without having to worry about exponentially expanding game trees.

Temporal Difference Learning

In 1992, IBM researcher Gerald Tesauro set out to create a backgammon program that could accurately and reliably estimate the probability that a player would win a game given the board's position. He used a technique called temporal difference (TD) learning. The basic learning approach works as follows.

Two computers play many games against each other using the same tunable evaluation function F. The function F takes a board as its input, and outputs the player's probability of winning given the input board. On each turn, the computer follows the three steps below.

The computer evaluates the strength of each of its moves with F. It chooses the move m which results in board b' that maximizes output o' = F(b').
The computer plays m, resulting in a new board b'.
The computer then adjusts its evaluation function. If b' is the end of the game and a winning position, it adjusts to move towards F(b) = 1, or if it's a losing position, towards F(b) = 0. Otherwise, it moves towards F(b) = o'.

The ultimate goal is for the function F to converge to a function that calculates the exact chances of a player winning given an input board b.

My Implementation of TD-learning

My first step to implement the TD algorithm was to choose a way to represent the board information as data that can be interpreted by the computer. For this, I turned toward one of Tesauro's original encoding schemes as outlined in Glen Matthews's thesis on TD-learning in nondeterministic board games. Any given board is encoded with 198 input units: 1 unit encodes the number of pieces white has taken off the board, 96 units for white's pieces on the board, 1 unit for white pieces on the bar, 2 units to encode whose turn it is, and the similar 98 (1+96+96) units for black.

Ideally, the numbers in the input vector fall in the [0, 1] range, so there's also some scaling that takes place. The 96 units to encode a single player's pieces works as follows. There are 4 input units for each point on the board. If a player has 0 pieces on a given point, then the encoding for that points is (0, 0, 0, 0). If there are pieces on the point, they are encoded in a unary-like manner, where 1, 2, and 3 pieces on a point are encoded as (0, 0, 0, 1), (0, 0, 1, 1), and (0, 1, 1, 1), respectively. When a player has n > 3 pieces on a given point, this is encoded as ((n-3)/2, 1, 1, 1).

After the input vectors are generated for a board, the calculation of the evaluation function is done by a feedforward, fully connected neural network with 40 hidden nodes. The output for a node is calculated by taking the sigmoid function S(x) = 1/(1+e^(-x)) of the sum of the inputs.

My Results

I attempted to replicate Tesauro's work by using a simple version of a temporal difference learning algorithm. In order to evaluate the strength of my player, I used two benchmark players: Gnu backgammon, a very strong open-sourced program using a neural network trained with techniques similar to temporal difference learning, and TatePlayer, a slightly modified 2-ply expectiminimax player using a pip ratio heuristic.

Below is a table showing the optimal opening moves for each of the 15 initial rolls (the game cannot start on doubles). The 1-ply neural network was able to correctly identify five of the correct moves. The 2-ply network identified six, and TatePlayer found five.

TatePlayer bases its decisions from the ratio of opponent to own pips, and therefore is using knowledge that is externally supplied. The neural network, however, does not have any external information given to it; it makes its decisions given only the rules of the game.

Although this opening move analysis was not incredibly successful, both the neural network and TatePlayer make for an intermediate level opponent, who is enjoyable to play against. With more time and tuning of the neural network training parameters, one would expect to see a stronger player emerge.

Page updated

Google Sites

Report abuse