Monte Carlo Tree Search

Adversarial search (especially in the game of Go) has been dominated in recent years by Monte Carlo Tree Search (MCTS). This algorithm works by simulating many random games, then choosing the move that most often led to a win in these random games. The choice of moves within each random game is biased by the results of previous random games: more promising moves are tried more often.

MCTS differs from classical (minimax) search in a number of ways:

    • Randomness is involved.

    • Some branches of the game tree are explored more deeply than others.

    • The search tree is an actual data structure; in minimax search, the "tree" is merely a conceptual representation of the sequence of function/method calls.

    • MCTS can stop at any time and report the best answer found so far. More time means more simulations and better moves. (Minimax can have this "just in time" property if it is implemented with iterative deepening, but the granularity is finer with MCTS).

    • There is no need for a static evaluation function.

This page describes the basic MCTS algorithm. Cutting edge implementations incorporate many improvements for efficiency, incorporating game-specific knowledge, etc.

The Search Tree

Each node in the tree corresponds to one state of the game. In a simple implementation, there is no need to store the state itself, but each child is associated with a move. (In Java this can be done with a java.util.HashMap.)

A tree partway through a game is shown at right. The color of each node indicates the player who has just played. (It's easy to get confused about this!) Thus, at the root, white has just played and it is black's turn.

Within each node are two numbers, expressed as a ratio. The first is the number of wins (in simulated games through this node) for the player who just played. The second is the number of games that have been simulated through this node.

(The detail-oriented reader will notice that the root is special in a couple of ways. First, the number of wins is irrelevant because the real game is already at this position. Second, the number of simulated games is equal to the sum of the number of simulated games through the root's children. Other nodes have one "extra" simulated game because that game was played before the children existed. These details can be ignored.)

In this example, it is black's turn. Black prefers the move on the right because that move is most likely to result in a win for black.

Playouts

Each simulated game is called a playout. Many (e.g., thousands of) playouts are performed before an MCTS player actually chooses its move.

A playout consists of the following steps:

    1. Travel down the tree from the root, choosing moves based on the information stored in the tree.

    2. If the simulated game is not over, play random moves to the end of the game.

    3. Update the numbers in the tree nodes that were visited in step 1. Also, add a new leaf as a child of the last such node (unless the game was over).

Selecting Moves Within the Tree

Within the tree, moves leading to more wins should be chosen more often. This causes the tree to grow more deeply down more promising branches, so the algorithm does not waste its time exploring the consequences of "obviously bad" moves.

A simple approach would be to simply choose the child with the highest win rate. Unfortunately, there are two complications.

First, the win rate for a move that hasn't been tried before is 0/0 — not a number. This is easily remedied by always preferring an untried move until all untried moves have been tried once. It is vital that the untried moves be tried in a random order.

Second, an excellent move might be abandoned due to bad luck. Even if a move would win 99 out of 100 playouts, if it happened to lose the first one, it would have a terrible win rate of 0/1 and never be tried again. Solving this problem requires a way to balance exploration (trying undersampled moves that might be good) with exploitation (using the information gathered so far about which moves are best). An algorithm that did too much exploration would waste its time on bad moves. An algorithm that did too much exploitation would sometimes miss good moves.

Researchers have found that a good compromise is the UCB1-TUNED formula.

(The name UCB stands for Upper Confidence Bounds; the formula chooses the move with the highest upper bound on its confidence interval. A move could have a high upper bound either because it has fared well in past playouts or because, having not experienced many playouts, it has a wide confidence interval. Details can be found in Gelly et al.'s paper Modification of UCT with Patterns in Monte-Carlo Go.)

The formula chooses the move i that maximizes the function

ri + sqrt(((ln P) / pi) * min(0.25, (ri - ri2 + sqrt(2 * (ln P) / pi))))

where ri is the win rate (wins over playouts) for move i, P is the total number of playouts through the parent, and pi is the the number of playouts through move i.

Moves are chosen using this formula until a leaf is reached. Now the remaining moves are chosen randomly.

Updating the Tree

After the playout is completed, the tree is updated. Each node on the path from the root to the last node visited gets one more playout and (maybe) one more win, depending on the result of the playout and whose turn it is to play at that node. Note that whether a win is added will alternate from node to node as the tree is descended.

Finally, a new leaf is added as a child of the last node visited. This gets one run and maybe one win.

The diagram at right shows the result of updating the previous tree after a playout in which the highlighted nodes were visited and black won. The new leaf is connected by a dashed line. (This sequence of moves was chosen arbitrarily; I did not verify that these are the moves within the tree that UCB1-TUNED would have chosen.)

The Full MCTS Algorithm

With those pieces in place, the full MCTS algorithm is simple:

Initialize the tree to consist of a single node with no runs

For a certain number of playouts (or until time runs out):

Run a playout on a copy of the current board state

Choose the move leading to the child with the most wins

We choose the child with the most wins, rather than the highest win rate, so that win rate ties are broken in favor of more thoroughly explored nodes. Otherwise, we might prefer a 1/1 node over a 99/100 node. That's fine during the playouts, but when the playouts are over, it's too late for exploration; this is the time to exploit our hard-earned data!

Further Reading

An excellent survey of the literature can be found in Browne et al, A Survey of Monte Carlo Tree Search Methods.