Adam Slivinsky - Analytics for Competitive Pokémon

Esports Analytics for Competitive Pokémon

Post Date: 11/25/2024

I'm a big Sports fan. I'm also a fan of few Esports, and many questions in Sports Analytics run in parallel to questions in Esports Analytics. But why follow or even be interested in competitive Pokémon?

Because it's essentially a head-to-head poker tournament. In a game of Pokémon, each player is constantly trying to out-predict their opponent, which results in a sequence of risky plays, calling bluffs, and crazy reads. In a weekend-long event, hundreds of competitors play long hours (Sometimes 8+ hours straight!), trying to win prize money and a chance to play at an end-of-year, invite-only World tournament.

To get a sense of what one of these tournaments are like, here's a snapshot from this last weekend's Sacramento Regional tournament stream (Includes some other venue events as well):

Packed venue. The tournament consists of a first day of non-stop round-robin play, often with no breaks for food, drink or bathroom (In line with the video gamer stereotype). The second day narrows players into a top 16, which plays out from there. Every tournament ultimately comes down to a match between two finalists who, after playing for 15+ hours already that weekend, effectively beat out hundreds of other competitors. It's a bite-sized regular season and playoffs, and a marathon for all who enter the tournament.

To be good at competitive Pokémon, you need three things. First, you need a good team. Team-building requires creativity and a good understanding of current strategies, and is arguably the most important of the three. Second, you need practice. The more the better, as practice helps identify good and bad matchups and good and bad lines of play. Third, you need the same talent that makes a Poker player great. The ability to be unpredictable by mixing up strategies, and the ability to read your opponent.

To answer questions in Pokémon analytics, we only really have access to data that can help us with the first aspect: team-building. There are multiple tournaments within the year, and LabMaus.net uploads team data, along with tournament results (Wins/Losses) for public use.

For example, here are some of the results from the Latin America International Tournament on 11/17/24:

What can we learn about what team-building choices performed well in a tournament? LabMaus just offers usage statistics (Shown on the right), but doesn't show what worked/didn't work. Most players just look at the top teams that had high placements to form opinions, but I’d like to gain knowledge from all of the tournament’s team compositions and placements.

To answer this question while abbreviating a lot of detail, I performed some statistical modeling. I treated each player’s team sheet data as a ‘Bag of Words’, which simplifies the problem to modeling the linear effect of including at least one of Pokémon X, or Item Y, etc., within a player’s team towards end-of-tournament win percentage. Essentially, I take in an entire tournament’s team sheet data and produce unit-level contributions of individual team choices, including Pokémon, Items, Abilities, Teras and Moves towards end-of-tournament win percentage. A longer explanation of the modeling details can be found at the end of this article.

To view the tournament inference, use this Shiny App (Takes ~10s to load). Here, you can interpret positive values in green to mean the choice was a good team-building option. You can select any of the Tournaments in Reg H from this current season to load that tournament’s dataset.

Select any player to view their team sheet information and what choices the model thought were successful. Below the individual team sheet plot, there are five bar plots, each showing the linear effect of including this element within your team (Pokémon, Item, Ability, Tera, Move). These bar plots are ordered by usage within the tournament (Most popular to least popular), so these linear effects can be compared to LabMaus’ usage rates.

For example, here's a team's data from the tournament in Louisville, Kentucky of this year. For the layman, green is good, red is bad, and the intensity of the color indicates how good the team-building choice was. This information tells us what choices worked, or what didn't work based on the tournament's outcomes.

Limitations of the Model/Inference:

The model treats a team with any two-of’s as just having one. For example, a team with two Fake Out users is treated as just having one Fake Out user. This was a design choice, as it is likely not a linear relationship between win probability and having 0, 1, 2, … Fake Out users on a team.
- Also: The team sheet contribution plot will fill in values left to right (Left-most Fake Out user will have non-zero value, all other Fake Out users will have value of zero)
Does not consider any interactions between team sheet elements. For example, Maushold and Annihilape are great together! But this model will always treat them as separate.
Low sample size. Tournaments usually only have ~400-600 competitors, and with a ‘Bag of Words’ approach, we have more covariates than observations. Not all team sheet variables are included, we exclude the dependent ones (least used).

To conclude, this was a fun personal project that let me connect my Statistics knowledge with a game that I find pretty interesting to follow. There’s not really anyone offering Competitive Pokémon Analytics besides usage statistics, so hopefully this project can be a starting point for future analytics projects answering questions towards game strategy.

Code for this project can be found here.

To go into further detail into the modeling choices/methods:

Team sheet data was gathered directly from LabMaus’ tournament webpages. A bag of words approach was used to create a one-hot encoded version of the team sheet dataset, and then dependent variables were removed such that the number of observations was greater than the number of covariates.

The linear model I chose to use was Ridge regression, fit with a penalized Iteratively Reweighted Least Squares Algorithm. LASSO and Elastic Net can restrict some covariate values to zero exactly, performing feature selection, and this is not what we’d like to do. In reality, we know all team sheet choices contribute to tournament success, so eliminating some of these choices reduces the information in the dataset unnecessarily.

Although regular OLS regression also achieves the goal of descriptive inference, OLS tends to overfit to covariates with low representation within the data. Ridge also falls prey to this, as can be seen in some of the bar plots on the Shiny App, but mitigates this somewhat compared to OLS. To choose the Ridge penalty parameter, I used 10-fold cross-validation, optimizing for both out-of-sample log loss and out-of-sample residual fit.

Another important point to consider is the inclusion of covariates with low representation within the dataset. There are many team-building choices that are only included in one or a few of teams, and because these teams often look similar to other teams within the dataset, models will tend to over-estimate the effect of these specific choices, which results in inference that is biased by these small-sample effects. For example, if there is a team in the dataset extremely similar to other teams, only differing by a single move choice, and this team performs poorly, the model will estimate that this specific move choice had a large negative contribution towards win percentage, when in reality this doesn't match the intuitive relationship between team-building and tournament performance.

Including these small-sample effects usually results in a butterfly effect, where team-building choices with small usage having large positive or negative contributions greatly affects the inference towards team-building choices with large usage. This results in plots and tables that are much harder to make conclusions with: We have to consider the combination of Pokémon+Item+Ability+Moves all together to get a sense of what the model favors/doesn't favor, rather than just looking at the contribution of a single team-building choice in isolation.

To help deal with this issue, I decided to remove covariates that have very low representation within the dataset. It's easy to remove covariates on less than some % of teams within the dataset, but some covariates are extremely correlated with others, and this also results in small-sample effects. For example, teams with Archaludon almost always run Electro Shot as well. Including both of these covariates will result in Archaludon having a negative contribution to win rate, while Electro Shot has a large high contribution to win rate. This is because the teams with Archaludon and without Electro Shot usually perform poorly. However, intuitively Archaludon is a very useful Pokémon, so its negative contribution to win rate leads to an incorrect conclusion made from the model's inference. To further deal with this issue, I found correlations between team-building choices, and eliminated 1 feature from each highly correlated pair. This leads to almost all team-building choices contributing positively to win percentage, which allows more concrete conclusions to be made and results in cleaner tables and plots.

With the chosen penalty parameter, I fit the IRLS Ridge algorithm to the entire tournament’s data set, to perform descriptive inference. Everything displayed on the Shiny App is descriptive inference performed on the entire data set, and should not be interpreted as predictive in any way.