We introduce PokéChamp, a minimax agent powered by Large Language Models (LLMs) for Pokémon battles. Built on a general framework for two-player competitive games, PokéChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, PokéChamp consistently outperforms the previous best LLM-based bot, Pokéllmon powered by GPT-4o, with a 64% win rate. PokéChamp attains a projected Elo of 1300-1500 on the Pokémon Showdown online ladder, placing it among the top 30%-10% of human players. In addition, this work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. We hope this work fosters further research that leverage Pokémon battle as benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multi-agent problems.
PokéChamp makes use of an LLM in combination with a minimax tree search to: (1) propose optimal actions to provide highly likely, diverse actions for potential strategies, (2) accurately model the opponent based on their move history, team, and strategy based on their skill-level, and (3) reflect on an internally planned game trajectory without requiring a terminal win/lose state. The above diagram visualizes this planning process per turn, showing that the agent first samples a move for itself and a likely move from its opponent for the given turn which is then simulated in the world model. This process is repeated multiple times to model planning for the agent, and the resulting states after the horizon is met are evaluated by the LLM and assigned a value. Lastly, the agent selects the action that resulted in the branch with the highest value.
Shown above in the tree search model search PokéChamp performs, PokéChamp uses one-step lookahead prompts to gain admissible heuristic information regarding the effect of actions. The lookahead performs a basic damage calculation (a tool easily accessible and commonly used by players via extensions and/or websites) to compute the time to knockout an opponent based solely on the current state's information.
The table above showcases the player and opponent action prediction accuracy by Elo rating. Random baseline performance is 7% for player prediction and less than 1% for opponent prediction due to partial observability. The use of predicting based on Elo allows the bot to dynamically adapt to the opponent's Elo and predict more probable moves based on their Elo rating.
In a 1v1 benchmark against a baseline bot, in this case we tested against Abyssal Bot due to its use in the actual games, we observe better performance over prior work.
The table above showcases the performance of PokéChamp and our used baseline models against the Abyssal baseline bot in the Gen 9 OU battle format using the terastallize mechanic and custom teams.
The table above also showcases the performance of PokéChamp and prior work using different LLM priors against the Abyssal baseline bot in the Gen 9 OU battle format. It shows that our work showcases better results than prior work while using a weaker LLM prior.
This tables shows the expected Elo of PokéChamp after allowing the bot to face real human players on the competitive Pokemon battling forum Pokemon Showdown. We observe from this benchmark that PokéChamp is able to achieve an Elo of 1500.
The matrix above shows PokéChamp demonstrates the winrates across different model matchups. In the top row, which shows PokéChamp's winrate (using GPT-4o) against other LLM-based and rule-based bots, we can see PokéChamp outperforming its opponent in every matchup.
Demos
Understanding Type Matchups
A poor type matchup between Dragapult and Kingambit as well as Iron Crown and Kingambit leads to the agent swapping to better its position. It continues to switch in Pokémon based on the type matchup to maintian a type advantage over the opponent.
Recognizing overall poor type matchups, the agent exhibits constant switching to ensure a type advantage over the opponent. This example highlights the agent not only switching into Pokémon that have a offensive type advantage over the opponent, but also swaps into Pokémon with a defensive type advantage over the opponent.
The agent, having familiarity with Terastallization, further shows its understanding of type matchups and the mechanic by swapping in Zamazenta against Enamorous. This is normally a poor position for the agent as Enamorous, being a Fairy and Flying type, has a major type advantage over Zamazenta, a Fighting type. However, the agent uses Terastallization to avoid being one-hit KO'd by the opponent and uses that opportunity to one-hit KO the opposing Enamorous.
Each example is snippet from a battle of the PokéChamp agent vs the PokéLLMon agent and they exhibit PokéChamp's strategically positioning itself to maintain an advantage over the opponent through the use of switching and its understanding of type matchups. While type matchups are studied and used in many bots and agents to choose optimal offensive moves, the use of switching Pokémon in and out for defensive purposes is much less common. This idea of switching Pokémon to gain a type advantage both offensively and defensively is widely used in the competitive scene. PokéChamp, per example, exhibits competitive level strategy not only via advantageous switching but also through predicting opponent moves to choose moves that are likely not the most advantageous in the current type matchup but better in the case the opponent switches.
This switching behavior and strategy emerges from PokéChamp's use of opponent modelling for predicting what the opponent will likely do next. By predicting what the opponent will likely do, this allows PokéChamp to make more optimal decisions, and aids it in its strategy planning. Modelling the opponent also provides the advantage of a more informed tree search, allowing the agent to better plan and react to the opponent. Due to this, we see an improvement in performance over previous work as the agent not only works with previous and present state information, but with likely future state information and outcomes as well.
Solving Stall Puzzles
This demo shows the agent solving the puzzle of facing an opponent's stall Pokémon by using switching in order to heal its Pokémon and eventually take down the opponent's Pokémon. Here the agent continuously swaps between Gliscor and Alomomola has both of them have the ability to heal. Swapping allows the bot to take advantage of the opposing Pokémon's poor opponent prediction and eventually defeat it.
Even with the opponent's Pokémon playing a stall strategy of using Recover and Protect over and over to eventually heal its health, the agent maintains its offense to sneak in hits in between the opponents moves. This consistent offensive strategy allows the Zamazenta to eventually wittle down the opposing Garganacl.
The above examples here showcase the PokéChamp's ability to solve simple stall puzzles against a single opposing stall Pokémon. Each each showcases a different scenario, such as having Pokémon without a type advantage but a mechanic advantage and having Pokémon with a type advantage. This ability for solving puzzles can also likely be attributed to the agent's ability to model the opponent. The opponent model is trained on data from Pokémon Showdown which sees a wide variety of strategies such as stall strategies. Stall strategies typically consist of Pokémon with lots of health and defense as well as moves that slowly whittle away at the opponent's health. They also usually have moves that allow them to last longer in battle such as healing or defensive moves such as Recover or Protect. Without a proper planning, stall strategies are hard to break. While the agent is still prone to this type of strategy, we have observed better performance with opponent modelling and tree planning that allow the agent to break through these stall strategies.