JiangJun: Mastering Xiangqi by Tackling Non-Transitivity in Two-Player Zero-Sum Games
Yang Li₁*, Kun Xiong₂, Yingping Zhang₂, Jiangcheng Zhu₂, Stephen Mcaleer₃, Wei Pan₁, Jun Wang₄,
Zonghong Dai₂†, Yaodong Yang₅†
1: The University of Manchester 2: Huawei 3: Carnegie Mellon University 4: University College London 5: Peking University
*: Part work done when Yang was an intern at Huawei
†: Corresponding authors: daizonghong@huawei.com;yaodong.yang@kcl.ac.uk
Abstract
This paper presents an empirical exploration of non-transitivity in perfect-information games, specifically focusing on Xiangqi, a traditional Chinese board game comparable in game-tree complexity to chess and shogi. By analyzing over 10,000 records of human Xiangqi play, we highlight the existence of both transitive and non-transitive elements within the game’s strategic structure. To address non-transitivity, we introduce the JiangJun algorithm, an innovative combination of Monte-Carlo Tree Search (MCTS) and Policy Space Response Oracles (PSRO) designed to approximate a Nash equilibrium. We evaluate the algorithm empirically using a WeChat mini program and achieve a Master level with a 99.41% win rate against human players. The algorithm’s effectiveness in overcoming non-transitivity is confirmed by a plethora of metrics, such as relative population performance and visualization results.
Part 1 : Geometrical Landscape Analysis of Xiangqi
To thoroughly analyze Xiangqi’s geometry, we obtained a dataset consisting of over 10,000 records of real-world Xiangqi games, which were sourced from the Play Ok game platform. We disclosed the spinning top game geometry hypothesis and non-transitivity problem in Xiangqi.
Part 2 : JiangJun
To address this non-transitivity issue, we propose the JiangJun algorithm. As depicted in Figure 4, JiangJun is comprised of two primary modules, the MCTS Actor and the Populationer. The MCTS Actor module generates training data, while the Populationer module maintains a Population consisting of varied policies, an inference module, and a maximum entropy Nash solver
The training of the JiangJun algorithm was performed on the Huawei Cloud ModelArt platform. In our whole resource pool, we have 90 32GB NVIDIA V100 GPUs and over 3000 CPU cores.
Part 3: WeChat mini-program and human experiment
In order to evaluate the effectiveness of our JiangJun algorithm in a comprehensive manner, we established a JiangJun mini-program on the WeChat platform. Over the course of six months of deployment, the mini-program allowed us to accumulate 7061 game records against human players (1924 games in the training phase, 5119 games in evaluation).
Throughout the “Training” phase, which covered the initial three deployment months, the JiangJun agent demonstrated a laudable performance, averaging a win rate of 98.16%, thus indicating its considerable prowess even during its learning and adaptation phase. Upon transitioning to the “Evaluation” stage in the subsequent three months, the mini-program recorded an average of approximately 1703 games each month, while JiangJun sustained a remarkable win rate, averaging 99.41%.
Part 4: Case Study: JiangJun efficiently overcome non-transitivity
In an ideally transitive game, the gamescape will appear as a line in the 2D embedding figure, while a non-transitive game will form a cycle (Balduzzi et al., 2019b).
As illustrated in Figure 7, the second order linear regression is almost a straight line across the three different training steps, suggesting that the JiangJun agent’s training is approaching transitivity. This finding indicates that our approach effectively conquers non-transitivity, which ensures consistent performance improvement across the population.
Part 5: Trajectories of JiangJun playing endgames
Notably, JiangJun exhibited the ability to adeptly navigate the complexities of the Xiangqi endgame. The endgame of Xiangqi is the final phase of a game, characterized by a sparse number of pieces on the board, which necessitates strategic maneuvers to secure a checkmate or draw.