Supporting Materials

Simulation-based Card Evaluation

Here are some example cards from the card pool of one of the independent runs, with their evaluations in both the meta environment and random environment (with the same pool of 1000 random cards). Note: Chaos Cards currently only has text UI. Thanks for Zachary Chavis for making all card arts in this page for better showcasing. The card names are also invented along with card arts.

Top three and bottom three cards in meta environment:

Meta: Eval 0.617, Rank #1

Random: Eval 0.958, Rank #1

Meta: Eval 0.573, Rank #2

Random: Eval 0.539, Rank #202

Meta: Eval 0.553, Rank #3

Random: Eval 0.526, Rank #241

Meta: Eval 0.072, Rank #1000

Random: Eval 0.392, Rank #867

Meta: Eval 0.093, Rank #999

Random: Eval 0.394, Rank #854

Meta: Eval 0.094, Rank #998

Random: Eval 0.468, Rank #470

Top three and bottom three cards in random environment:

Random: Eval 0.958, Rank #1

Meta: Eval 0.617, Rank #1

Random: Eval 0.868, Rank #2

Meta: Eval 0.540, Rank #6

Meta: Eval 0.834, Rank #3

Random: Eval 0.492, Rank #23

Random: Eval 0.276, Rank #1000

Meta: Eval 0.234, Rank #878

Random: Eval 0.282, Rank #999

Meta: Eval 0.310, Rank #700

Random: Eval 0.293, Rank #998

Meta: Eval 0.161, Rank #956

(The token card from D2)

Some notes/observations:

- - The range of strength evaluation (win rate weighted by participation) is drastically different for meta and random environment. The meta environment has overall lower values compared to the random environment. This is expected because meta environments evolves to competitive decks so the stats will be the result of competing primarily to competitive decks (over 50% evaluation in the meta environment indicates very strong cards). Therefore, directly comparing numbers cross different environments is not very meaningful.
  - We believe the meta environment gives more reasonable evaluations for strong cards, especially for cards with utility oriented effects, such as card drawing effects, compared to the random environment. The examples highlighted in cyan are such cases, where the random environment vastly under-estimates the strengths of card S1 and S4. A simple explanation for the case of card drawing is that meta environment has mostly decks of strong cards, therefore card drawing replenish ones hand with cards that is likely to be useful for winning the game, unlike the random environment, where card drawing is not valued so much.
  - We believe the meta environment gives a less reliable evaluation for weak cards, compared to the random environment. This is because of the way the evolution works in the meta based evaluation, i.e. we want to test more on stronger cards, which, naturally results in less tests on weak cards, even with an exploration-exploitation balance. This decision is made because we believe that if given the choice, a human play would not select obviously weak cards for building decks, therefore the accuracy on evaluating those cards are less important.

For the complete list of cards for this independent run, please refer to [Here (with meta evaluation)] and [Here (random evaluation)]. The evaluations of a card can be found as the last number following a card description (larger is stronger). Note: the card data files attached here use a slightly older version of card description, where the term "ability" was referred to as "attribute".

Validation of Evolution

In order to validate that the evolution method actually evolved the environment into relatively comparative one (we call as the meta environment). We performed pairwise matches between the final active deck pool in the meta environment and the top 30 decks in the random environment. The results shows an overwhelming win (>95%) for the meta decks (see Figure below, this is averaged over all independent runs). The main point here is that, when we are evaluating cards in the meta environment, we are actually them with a meaningful context, rather than just matching randomly between random decks, which is not what real card environment would be like.

Card Strength Prediction

Here are some example cards from the test set (1000 newly generated random cards) where we generate new cards and applies the neural network model trained on the evaluations from one of the independent runs (the same run as the one use for showing evaluation example above).

Predicted strongest/weakest leader/minion/spell:

Strongest Leader

Predicted Eval 0.528, Rank #1

Strongest Minion

Predicted Eval 0.517, Rank #6

Strongest Spell

Predicted Eval 0.509, Rank #14

Weakest Leader

Predicted Eval 0.446, Rank #801

Weakest Minion

Predicted Eval 0.402, Rank #1000

Weakest Spell

Predicted Eval 0.446, Rank #802

The observation here is able to tell relatively for drastically strong and weak cards (to a large part based on their stats), but it is very conservative on the numerical range of the predictions.

Selected example of two zero mana spells:

Predicted Eval , Rank #34

Predicted Eval , Rank #277

This example comparison shows that the trained predictor is able to capture the relative strength on some other commonly appeared patterns such as dealing damage to enemies versus allies (other less occurred effects are predicted with less distinctions due to insufficient data).

For the complete list of cards for this test set, please refer to [Here]. Note: the card data files attached here use a slightly older version of card description, where the term "ability" was referred to as "attribute".

Neural Network Model Comparison

We did some test on using different neural network models for predicting card strengths. The model we compare our recursive neural network (RvNN) model against are sequence models, including the recurrent neural network (RNN), the gated recurrent unit (GRU), the long short-term memory (LSTM) models. For the sequence models, we flattened our hierachical card representations with pre-order tree walks. For fair comparison purpose, we keep the number of parameters in all these models close to each other (all in the range of 4.1k ~ 4.3k). In addition, they use the same activation function whenever appropriate (as sine activation function seem to speed up the training greatly, we used it in all the models in the places where relu or tanh where conventionally used; the sigmoid activations in some sequence models are not maintained. The following is the progressing of the validation loss (log-cosh of the difference, a blend of MSE and MAE) over run time for each of the models. The run times are reported on a machine with 2.3 GHz Intel Xeon CPUs (limited to use 4 cores in this part) and a 64 GB RAM.

The figure above shows that our RvNN model trains faster and have a lower loss than other models. We then take some point in the training history for each model so that the training time is similar across models, and investigates other statistics besides the loss, as listed in the table below.

From the table above, our model is still performing reasonably well among the models. Here are some additional notes:

- The underfitting/overfitting is somewhat hard to decide, and we are probably not very robust when considering how much to train.
- The unweighted results is pretty underwhelming, especially considering the prediction tends to have a small range already.
- The weighted results, looks good for RvNN and RNN in terms of the weighted correlation coefficient, but still not very ideal.

The RNN looks not bad among the models and seems better than GRU and LSTM. One relevant factor is that in order to have similar number of parameters, RNN have a much larger hidden layer size than GRU and LSTM (also much larger than any internal layer of RvNN), which might grant it some advantage of reserving features.
- At least it is clear that our RvNN model learns faster than other models, even slightly faster than the RNN with very large hidden layers.