Customizing Scripted Bots: Sample Efficient Imitation Learning for Human-like Behavior in Minecraft

Human game-play in the Minecraft Gladiator Arena

Overview

Despite the many advances in machine learning, video games still define character behavior through scripted rules or behavior trees. This results in predictable behavior which is discernibly different from human gameplay. We demonstrate an approach to combining imitation learning with scripted agents in order to efficiently train hierarchical policies.

Integrating both programmed and learned components, these hierarchical polices leverage the strengths of both approaches. The learned controller improves the expressiveness of the original scripted agent, allowing more diverse and human-like behavior to emerge. The remaining scripted elements provide interpretable, guaranteed behavior, allowing a developer to be confident in the behavior of the learned policy. We demonstrate this interplay between classical AI techniques and statistical machine learning through a case study in Minecraft.

Environment

Our experiments were conducted in the Minecraft Gladiator Arena scenario using Malmo. This scenario pits the player against waves of enemies in an enclosed room (map shown to the right). Items, such as weapons, armor, and health potions, are spawned on the east and west portions of the map.

A scripted agent has been developed specifically for this scenario and uses hand-coded path-finding, inventory management, attack logic, and a fixed priority list to determine its next action. This scripted agent is reasonably effective at defeating enemies and collecting items though it behaves noticeably different from humans. Humans use a variety of different strategies and adapt these strategies based on contextual information.

Given 33 human demonstrations, we would like to leverage the existing scripted agent to create an agent which demonstrates more diverse, human-like behavior in this scenario.

Overview of Approach

- Extract high-level strategy from player data using a labeling function

- In our case, the strategy is determined by the sequence of targets (or subgoals) the agent selects and pursues

- Create a hierarchical policy; a meta-controller that sets high-level strategy and invokes scripted subroutines as low-level policies

- Train the meta-controller using imitation learning for human-like strategy


For more information, check out the paper (link coming soon!).

Gameplay Comparison

Human demonstrations contained many different behaviors including various high-level strategies as well as contextual decisions such as retreating.

The scripted agent exhibited predictable, guaranteed behavior which was relatively inflexible with respect to context or human-like strategies.

Standard imitation learning approaches, such as behavior cloning, result in an agent which usually dies without attacking at all or avoiding the enemies as there are too few demonstrations.

The hybrid agent demonstrated human-like high-level strategies and contextual decision making while preserving various behavior guarantees afforded from the scripted components.

Evaluation

Does the Controller Generalize Across Human Trajectories?

Before evaluating the behavior of the resultant hybrid agent, we first evaluate the predictive accuracy of the model on the test set. Evaluation is performed by executing the policy network both on fixed length sequences as well as using all previous observations. When using all previous observations, we expect the controller to struggle since it must remember the salient information from state sequences in its LSTM units. As a baseline, we present the performance of both random guessing and the scripted agent’s target selection logic for selecting the current human target as imputed by the labeling function on the test set. Results are shown in the figure below. As expected, the learned controller scores a higher accuracy than the presented baselines and more closely matches the human target selection than the original scripted agent.

This suggests that the learned controller is indeed able to identify the target a player is likely to go after given the recent gameplay context, when evaluated in the kinds of situations a player might face.

Predictive Accuracy Generalization Across Human Trajectories

Does the Agent Exhibit Diverse, Human-like Strategies?

In this section, we evaluate the distribution of strategies used by the human players, behavior cloned agent, original agent, and hybrid agent. In this context, a strategy is considered to be the order in which the player collects (or discards) any items and kills the enemies. To simplify the exposition, we only report the strategy of the player during the first round.

In the first round of the Arena, there are two items and three enemies spawned. The items include a “Stone Sword” and an “Iron Chestplate” and all three enemies are zombies which have a melee attack but no ranged attacks. Strategies are determined by the order of player events including picking up items, discarding items, and killing enemies. The following abbreviations are used: Kill enemy, Pick up item, and Discard item. When picking up or discarding items, the item is either Sword or Chestplate. The ATTACK ONLY strategy is when the player attacks the enemies without any item and dies before killing a single enemy.

One common strategy among the human trajectories was to first collect the sword then kill all the enemies. Conversely, the original scripted agent always prioritizes collecting the items then returns to an open portion of the map and fights the zombies. Although this is the default behavior of the scripted agent, stochasticity in the environment and its ability to fight any enemy within range sometimes results in enemies being killed while it is collecting items. The BC agent often dies without attacking and never kills an enemy nor collects any items.

Humans showed a variety of different strategies. The most frequent of which was picking up the sword then killing the 3 zombies (PS_K_K_K) followed by collecting both weapons then killing the enemies (PS_PC_K_K_K), and simply attacking without collecting any items. The latter strategy was more common in the novice demonstrations.

The original scripted agent always collects the sword, then chestplate and then returns to an open area of the map to fight the enemies. However, due to stochasticity in the environment and the agent's auto-attacking of any nearby enemies, it may sometimes kill enemies that it runs into while collecting the items (such as PS_K_PC_K_K).

As shown in the earlier video, the agent trained using behavior cloning performs very poorly and dies without even attacking the enemies in 92% of collected trajectories. In the remaining trajectories, the agent lands a blow but still loses before killing any enemies or collecting any items.

The hybrid agent shows much more diversity in strategies and exhibits some of the strategies commonly seen in the human demonstrations. The two most common strategies (PS_K_K_K and ATTACK_ONLY) are also two of the top strategies seen in the human demonstrations.

Can the Agent Produce Guaranteed Behavior?

The hybrid agent is evaluated with respect to preserving two behavior guarantees provided by the scripted sub-policies: the use of health restoring items and discarding useless items. As the training demonstrations were collected from non-experts, some of the demonstrations contain examples violating the desired behavior guarantees. Consequently, standard imitation learning approaches cannot hope to consistently produce behavior which complies with the desired guarantees much less provide guarantees about the agent’s behavior. To evaluate these behavior guarantees, we defined two related constraints to check for behavior violations on the trajectories of the hybrid agent and the human demonstrations.

As we would like to ensure that the agent uses health restoring items when needed, a constraint was defined which ensured that the agent did not have any unused health recovery items in its inventory when it died. Five of the 33 human demonstrations (15.15%) violated this constraint. As expected, none of the trajectories collected from the scripted agent nor the hybrid agent violated this constraint.

As items are automatically picked up when they are touched by the agent, the second constraint was defined more flexibly and simply checked if the agent discarded the useless items within 1 minute. This constraint was violated by 23 of the 33 human demonstrations (69.70%). Furthermore, the majority (68.86%) of the states from the human demonstrations included inventories containing useless items. As with the previous constraint, neither the scripted agent nor the hybrid agent violated this constraint.

The agent trained with behavior cloning did not live long enough to collect any health or useless items. Consequently, it was not able to violate any of the constraints. However, improvements to its performance is likely to result in violations of these constraints given the prevalence of the behaviors in the demonstrations.

Can the Learned Controller Affect Other Scripted Subroutines?

The behavior of the agents and humans were also evaluated with respect to the occupancy of the map across all trajectories visualized as heatmaps in Figure 6. This represents another dimension of playstyle in which the scripted agent is distinct from the human players. Although this form of evaluation does not capture the specific context or subgoal of the player, it provides insight into the aspects of behavior across trajectories.

All the heatmaps show a moderate degree of occupancy for points on the left and right of the map. These points correspond to the item spawn locations; their high visitation suggests that both the humans and the agents have a tendency to collect these items. The human demonstrations (shown in the left heatmap) show a somewhat uniform occupancy for the locations between the items with a slightly higher occupancy in the center where the player is spawned.

The heatmap for the scripted agent (shown in the center heatmap) shows a high frequency of occupancy for the left, right, top and bottom points of the map. The high occupancy locations in the top and bottom of the map are locations known to the scripted bot and used as a default location when there are no other targets to pursue. This ensures that the agent will not move to a location where it may be cornered by enemies. The degree to which these four locations are occupied by the scripted agent illustrates the consistency of its behavior across trajectories which is distinct from the behavior shown in the human demonstrations. In contrast, the hybrid agent’s occupancy map does not prioritize the top and bottom locations and visually mimics the occupancy of human demonstrations.

Human Demonstrations

The human heatmap shows an increased amount of time spent in the left and right portions of the map. These areas are both item spawn locations and provide cover from enemies. The remaining areas show a somewhat uniform coverage with the exception of locations of pillars.

Scripted Agent

The scripted agent shows significant amounts of time spent in the left and right portions of the map (item spawn locations) as well as in the top and bottom portions of map. The latter locations correspond to open areas of the map where the scripted agent will stand and fight opponents.

BC Agent

The BC agent shows an increased amount of time at various locations on the map where it tends to move to and then get cornered and die. These include against the top wall and a pillar just ahead (and a little to the left) of the start location.

Hybrid Agent

The hybrid agent's heatmap shows similar occupancy to the human demonstrations with focal points near the item spawn points and the initial location.

Is the Agent's Overall Performance Affected?

The overall performance of an agent in a trajectory is evaluated by recording the level the player was able to reach. The performance of the humans, scripted agent, and hybrid agent across the collected trajectories is shown in Figure 7. The distribution of human performance shows a distinct split in ability in which some demonstrations were able to complete all 7 levels whereas many others were defeated on levels 2-4. The scripted agent showed a similar trend around levels 2 and 4 but had significantly fewer trajectories reaching level 7 (12%). The hybrid agent showed some performance degradation as it was defeated more often during level 2 and seldom reached the 4th level and beyond.

The second level introduces multiple ranged units which often defeat the players. The successful human demonstrations often employ a number of different behaviors such as strafing, dodging, and using ranged weapons themselves for which the bot does not have any corresponding subroutines. As a result, the use of a more human-like SelectTarget controller can hinder the bot’s performance when human-like subroutines are necessary for the current target to be achievable.

Human demonstrations contained distinct skill levels. Most human players did not make it to the 5th level with many losing during the introduction of ranged units in level 2. The humans who made it past the 4th level often continued through to the final level.

The scripted agent frequently died on the 2nd level with the introduction of ranged units. Death on the 4th level was a close second and it was able to make it to the final level 12% of the time.

The BC agent never made it past the first level.

The hybrid agent shows some performance degradation as it was defeated more often in the 2nd level and seldom reached past the 4th level and beyond.