Abstract:
We consider the problem of building a reinforcement learning (RL) agent that can both accomplish non-trivial tasks, like winning a real-time strategy game, and strictly follow high-level language commands from humans, like “attack”, even if the command is sub-optimal. We call this novel yet an important problem, Grounded Reinforcement Learning (GRL). Compared with other language grounding tasks, GRL is particularly non-trivial and cannot be simply solved by pure RL or behavior cloning (BC). From the RL perspective, it is extremely challenging to derive a precise reward function for human preferences since the commands are abstract and the valid behaviors are highly complicated and multi-modal. From the BC perspective, it is impossible to obtain perfect demonstrations since human strategies in complex games are typically sub-optimal. We tackle GRL via a simple, practical, and tractable constrained RL objective and develop an iterative RL algorithm, REinforced demonstration Distillation (RED), to obtain a strong policy. We evaluate the policies derived by RED, BC and pure RL methods on a simplified real-time strategy game, MiniRTS. Experiment results and human studies show that the RED policy is able to consistently follow human commands and achieve a substantially higher win rate than baselines even under random commands.
The grounded reinforcement learning (GRL) problem in the MiniRTS environment.
(A) MiniRTS is a real-time strategy game where the player in blue needs to control its units to kill the enemy units in red.
(B) A conventional RL agent.
(C) A dataset of human demonstration in the form of paired abstract language commands (e.g, "attack") and control action sequences. Human actions are often sub-optimal.
(D) GRL aims to learn a command-conditioned agent such that it plays a winning strategy stronger than the human executor.
(E) A GRL agent should strictly follow the human command even if it is sub-optimal.
The mission of GRL is to learn a policy that first strictly follows the commands, then tries its best to win.
In the conventional RL setting, a well-trained agent focuses on how to win a game and ignores all the commands (as shown in Fig.B). But in the GRL setting, a well-trained agent should strictly follow the command even if it is sub-optimal. For example, as shown in Fig.E, very few enemy units are remaining, the optimal command should be "attack", but the commander asks the agent to "retreat". And the obedient agent should still faithfully control the units to stop attacking and retreat.
Moreover, the action taken by the human executor is also sub-optimal: as shown in Fig.C, a unit with full HP is under attack while a clear better strategy is to attack another dying unit. But a well-trained GRL policy should attack the dying one when following the "attack" command (Fig.D).
MiniRTS is a grid-world RL environment that distills the key features of complex real-time strategy games. It has two parties, a player (blue) controlled by a human/policy against a built-in script AI (red). The player controls units to collect resources, do construction and kill all the enemy units or destroy the enemy base to win a game.
Game units: There are 3 kinds of units in MiniRTS, including resource units, building units, and army units.
Resource Units: Resource units are stationary and neutral. Resource units cannot be constructed by anyone and are created at the beginning of a game. Only “peasant” (an army unit type) of both teams can mine resources from the resource units. The mined resources are necessary to build new building units or army units.
Building Units: (town hall, barrack, blacksmith, stable, workshop, guard tower)
MiniRTS supports 6 different building unit types. 5 of the building unit types can produce particular army units by consuming resources (right figure). The “guard tower” cannot produce army units but can attack enemies. All the building units cannot move. Building units can be constructed by “peasants” at any available map location.
Army Units: (peasant, spearman, swordman, cavalry, archer, dragon, catapult)
7 types of army units can move and attack enemies. Specifically, a “peasant” can mine resources from resource units and construct building units with mined resources, but its attack power is low. The other 6 army unit types and “guard tower” are designed with a rock-paper-scissors dynamic. As shown in the left figure, each type has some units that it is effective against and vulnerable to.
We present a screenshot of the gameplay interface in the image below. The user can select a command from a set of recommendations using the "Select Command" button or input an arbitrary (English) command in the text box below. If the user types ENTER in the textbox while leaving it empty, the previous command will be issued. This functionality is designed because we notice that it is a natural choice to keep sending the same command for several continuous time steps. To send an empty command, the user can press the "Send Empty Command" button. Since some users may not be good at playing real-time strategy games, one can optionally toggle the fog of war using the "Fog of War: ON(OFF)". This lowers the difficulty of the game.
Here we present an example that a human commander instructs the RED policy to win a game using dragon by several commands. We call this strategy "Tower Defense and Dragon Rush". It is difficult for pure RL to learn such a complicated strategy since building other army units (e.g., “spearman”, “swordman” or “cavalry”) is more direct for a fast win.
Some human participants prefer to build dragons since dragons sound powerful and cool. But building dragons makes it more challenging to win because dragons are expensive. If the player builds dragons at the beginning of a game, he/she will get few dragons ready when the enemies start to attack him/her, leading to a game loss. So the RL-trained policies prefer to build “spearman”, “swordman” or “cavalry” since the win rates of these units are higher.
A well-trained RED policy can follow human commands to execute this complicated strategy. In addition, although the human commands are vague, the RED policy can take suitable actions to follow the commands. For example, it builds towers close to each other to strengthen the defense. After finding the enemies, it automatically sends the dragons to attack them.
There are 3 phases in this strategy. Phase 1: Build multiple guard towers to ensure safety. Phase 2: Build a workshop and build dragons. Phase 3: Send the dragon to find the enemies and attack.
Phase 1:Build Towers
In the early stage of the game, to ensure safety, the human player gives the command “build a guard tower” at every time step. So the RED policy lets one peasant build the tower, and the rest peasants keep mining resources from the resource unit. Since the command is provided at every time step, the peasant keeps constructing towers one after another.
Phase 2: Build dragons
After building enough towers, the human player changes the command to “build a workshop”, where the workshop can produce dragons. Then one peasant starts to build the workshop immediately.
Once the workshop is built, the human player changes the command and keeps asking the policy to “build dragon”. Then, the workshop keeps building dragons, and all the peasants return to mine resources. In this phase, some enemy units started to attack but are all defended by the towers.
Phase 3: Find the enemies
After building 4 dragons, the human player changes the command to “scout the map”, and the dragons fly around the map to explore. Once the dragons find the enemies, the RED policy sends them to attack the enemies automatically.
FUll Replay: The complete replay video
In addition to the above 3 phases, the human player also gives some extra commands to instruct the RED policy (e.g., "all peasant mine"), we show the complete replay video here.