Adaptive Team Behavior Planning
using Human Coach Commands
Conditioning collective behaviors with human-understandable instructions
In the robot’s operating life, the agent is required to act in real environments dealing with rules and constraints that humans ask to satisfy. The set of rules specified by the human might influence the role of the agent without changing the goal or the current task. With similar constraints, a robotic agent in a RoboCup soccer match deals with a partially unobservable, unpredictable and dynamic scenario and can benefit from precise human indications that can condition the robot behavior before or during the time of the match. To this end, classical planning methodologies can be enriched with temporal goals and constraints that enforce non-Markovian properties on past traces. The proposed work aims at exploring the application of real-time dynamic generation of policies whose possible trajectories are compliant with a set of PPLTL rules, introducing novel human-robot interaction modalities for the high-level control of team strategies in RoboCup SPL matches.
The human is now enabled to coach robots at a high-level, through textual or vocal commands, conditioning the generated policies with temporal goals over a pre-defined and context-specific set of conditionable predicates.
Behavior conditioning process
To enable a team behavior to receive external conditioning, we propose a system capable of accepting vocal commands from a user. We obtain dynamic behavior generation by conditioning behaviors using constraints expressed in Pure-Past Linear Temporal Logic, encoding non-Markovian properties on the trace.
A PDDL domain and a base goal (i.e. "goalScored") are provided. The user is then enabled to give commands, over a set of constrainable predicates that depends on the current PDDL domain, in the form of pre-defined sentence templates, which are then translated to their PLTLf equivalent. The team behavior is then adapted in real-time, with the potential outcome of modifying its collective strategy.
Goals and constraints are role-specific and the overall architecture ensures their synchronization through shared registries for fluents, percepts, actions and their parameters.
Architecture
The system is structured modularly, it is easily extendable with new user interaction modules, offering the chance to also explore other Human-Robot interaction modalities and high-level control paradigms.
Architecture of the proposed system
The figure above shows an instance of the architecture throughout the execution of a multi-robot behavior for passing the ball or scoring a goal, involving policies for the "Striker" and "Jolly" roles (which always tries to make itself available to receive a pass). In fact, each robot in the team has a role, that is self-assigned through a mutually exclusive coordination algorithm, which uses percepts exchanged by robots at a lower level and ensures that assigned roles are mutually exclusive.
The architecture can be divided into the following parts:
The robot control framework instances, based on the B-Human framework, running on each robot. The framework manages the execution of received actions and announces their completion. Robots communicate at a lower level to ensure synchronization of percepts and to allow role assignment using a mutually exclusive coordination algorithm.
A communication manager that wraps communication controllers, each one statically assigned to a robot, and a policy for each robot’s role in the team. The current policy action for each role is sent to the robot assigned to that role. Policies are updated only when the previous action has been completed. The current set of percepts, updated by the robots, a set of selected fluents, and robot actions are stored in the Environment Registries. Fluents, computed from the latest percepts, are stored in the Fluent Registry. Percepts, labeled with the role that sent them, are stored in the Value Registry and are retrieved during the evaluation of fluents. The Action Registry stores action templates and instances, mapping them to the available robot skills.
A non-deterministic planning module with behavior conditioning. Textual or vocal commands are translated to PLTLf rules by matching them with pre-defined templates. Several constraints over a pre-defined set of constrainable predicates can be specified for each role and the final temporal goal is obtained as their conjunction with the original goal. A new plan is generated each time a new constraint is received.
Examples
Single-Agent example
Our PDDL domain for a naive striker behavior features three actions: moverobot, kickball and carryball.
In our case, kickball and carryball have the same post-conditions, but they are linked to different low-level skills in the robot (a kick and a dribbling skill). The goal for the striker is to have the ball at the goaltarget. To condition the policy, at least one constrainable predicate is needed: in our case, the only conditionable predicate is isat, modeling the position of the ball or the robot. The user is allowed to specify a role-specific constraint. Notice that all domain objects (such as locations) have to be grounded using the Environment registries. Robots perform policy actions, executing their corresponding low-level atomic behaviors implemented in an option framework (which is the common approach used in SPL).
In a first example, the goal isat ball goaltarget is unconstrained. The result is a simple plan requiring the robot to reach the ball and then carry it to goal.
Formal scenario description (PDDL domain/problem and goal)
PDDL domain
(define (domain robocupdeterministic)
(:requirements :strips :typing)
(:types movable location)
(:predicates
(isrobot ?mov - movable)
(isball ?mov - movable)
(isat ?mov - movable ?loc - location)
(goalscored)
)
(:action moverobot
:parameters (?rob - movable ?from - location ?to - location)
:precondition (and (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)))
:effect
(and (not (isat ?rob ?from)) (isat ?rob ?to))
)
(:action kickball
:parameters (?rob - movable ?b - movable ?from - location ?to - location)
:precondition (and (isball ?b) (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)) (isat ?b ?from) (not (isat ?b ?to)))
:effect
(and (isat ?rob ?from) (not (isat ?b ?from)) (isat ?b ?to))
)
(:action carryball
:parameters (?rob - movable ?b - movable ?from - location ?to - location)
:precondition (and (isball ?b) (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)) (isat ?b ?from) (not (isat ?b ?to)))
:effect
(and (not (isat ?rob ?from)) (isat ?rob ?to) (not (isat ?b ?from)) (isat ?b ?to))
)
)
PDDL problem
(define (problem simplestriker)
(:domain robocupdeterministic)
(:objects
robot1 ball - movable
strikercurrentposition ballcurrentposition kickingposition goaltarget - location
)
(:init
(isat robot1 strikercurrentposition)
(isat ball ballcurrentposition)
(isrobot robot1)
(isball ball)
)
(:goal <see below>)
)
Notice that the goal is overwritten by the PLTLf goal during the compiling process
PLTLf goal
isat_ball_goaltarget
Which is the compilable equivalent of the PDDL goal.
isat ball goaltarget
The robot can be forced to carry the ball to the kickingposition location by just passing the additional command ”at least once isat ball kickingposition”, translated to the PLTLf constraint O(isat ball kickingposition) and added as a conjunct to the original goal. The new generated policy is in the figure to the right. As we can see, an additional step has been added to force the robot to go through the waypoint.
Formal scenario description (PDDL domain/problem and goal)
PDDL domain
(define (domain robocupdeterministic)
(:requirements :strips :typing)
(:types movable location)
(:predicates
(isrobot ?mov - movable)
(isball ?mov - movable)
(isat ?mov - movable ?loc - location)
(goalscored)
)
(:action moverobot
:parameters (?rob - movable ?from - location ?to - location)
:precondition (and (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)))
:effect
(and (not (isat ?rob ?from)) (isat ?rob ?to))
)
(:action kickball
:parameters (?rob - movable ?b - movable ?from - location ?to - location)
:precondition (and (isball ?b) (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)) (isat ?b ?from) (not (isat ?b ?to)))
:effect
(and (isat ?rob ?from) (not (isat ?b ?from)) (isat ?b ?to))
)
(:action carryball
:parameters (?rob - movable ?b - movable ?from - location ?to - location)
:precondition (and (isball ?b) (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)) (isat ?b ?from) (not (isat ?b ?to)))
:effect
(and (not (isat ?rob ?from)) (isat ?rob ?to) (not (isat ?b ?from)) (isat ?b ?to))
)
)
PDDL problem
(define (problem simplestriker)
(:domain robocupdeterministic)
(:objects
robot1 ball - movable
strikercurrentposition ballcurrentposition kickingposition goaltarget - location
)
(:init
(isat robot1 strikercurrentposition)
(isat ball ballcurrentposition)
(isrobot robot1)
(isball ball)
)
(:goal <see below>)
)
Notice that the goal is overwritten by the PLTLf goal during the compiling process
PLTLf goal
isat_ball_goaltarget && O(isat_ball_kickingposition)
where the goal is the same as the previous example but we have added the additional constraint O(isat_ball_kickingposition) forcing the robot to first reach the waypoint called kickingposition before carrying it to goal.
Given a predicate modeling high battery consumption in the post-conditions of the carryball action, a new constraint H(¬highBatteryConsumption) is added, preventing the robot from performing battery-consuming actions. In our case the carryball action is considered battery-consuming (as the robot constantly needs to keep itself aligned to the ball and therefore has to work slowly), therefore the only ball movement action that can be performed is kickball, as shown in the figure to the right.
Formal scenario description (PDDL domain/problem and goal)
PDDL domain
(define (domain robocupdeterministic)
(:requirements :strips :typing)
(:types movable location)
(:predicates
(isrobot ?mov - movable)
(isball ?mov - movable)
(isat ?mov - movable ?loc - location)
(goalscored)
(highbatteryconsumption)
)
(:action moverobot
:parameters (?rob - movable ?from - location ?to - location)
:precondition (and (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)))
:effect
(and (not (isat ?rob ?from)) (isat ?rob ?to))
)
(:action kickball
:parameters (?rob - movable ?b - movable ?from - location ?to - location)
:precondition (and (isball ?b) (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)) (isat ?b ?from) (not (isat ?b ?to)))
:effect
(and (isat ?rob ?from) (not (isat ?b ?from)) (isat ?b ?to))
)
(:action carryball
:parameters (?rob - movable ?b - movable ?from - location ?to - location)
:precondition (and (isball ?b) (isrobot ?rob) (isat ?rob ?from) (not (isat ?rob ?to)) (isat ?b ?from) (not (isat ?b ?to)))
:effect
(and (not (isat ?rob ?from)) (isat ?rob ?to) (not (isat ?b ?from)) (isat ?b ?to) (highbatteryconsumption))
)
)
PDDL problem
(define (problem simplestriker)
(:domain robocupdeterministic)
(:objects
robot1 ball - movable
strikercurrentposition ballcurrentposition kickingposition goaltarget - location
)
(:init
(isat robot1 strikercurrentposition)
(isat ball ballcurrentposition)
(isrobot robot1)
(isball ball)
)
(:goal <see below>)
)
Notice that the goal is overwritten by the PLTLf goal during the compiling process
PLTLf goal
isat_ball_goaltarget && !O(highBatteryConsumption)
where the goal is the same as the previous example but we have added the additional constraint !O(highBatteryConsumption) ("not Once highBatteryConsumption") which prevents the execution of actions that have the highBatteryConsumption predicate in their post-conditions (in our simplified example is, for instance, the "carryball" action, which requires the robot to constantly realign to the ball and therefore might be more power consuming).
Multi-Agent example
In our multi-agent scenario, the policy for the "Striker" role and the policy for the ”jolly” role, whose purpose is to always make itself available for a pass, are obtained from non-deterministic domains.
The jolly turns to the striker if it is already reachable for a pass or reaches its waiting position otherwise. The additional "check" states are required for synchronization with the other robot and just leave the robot in an idle state, waiting for the reception of updated percepts from the striker role.
Formal scenario description (PDDL domain/problem and goal)
PDDL domain
(define (domain jollyrobocupfond)
(:requirements :strips :non-deterministic)
(:predicates
(fluentisstrikerobstacleblockinggoal)
(fluentisjollyinposition)
(fluentisjollyalignedtostriker)
(jollyinposition)
(jollyalignedtostriker)
(jollyavailable)
)
(:action checkobstacleposition
:parameters ()
:precondition
(jollyavailable)
:effect
(oneof
(fluentisstrikerobstacleblockinggoal)
(when (not (fluentisstrikerobstacleblockinggoal)) (jollyinposition))
)
)
(:action movetoreceivingposition
:parameters ()
:precondition
(and
(fluentisstrikerobstacleblockinggoal)
(not (jollyinposition))
)
:effect
(jollyinposition)
)
(:action turntostriker
:parameters ()
:precondition
(and
(jollyinposition)
(or
(not(jollyalignedtostriker))
(not(fluentisstrikerobstacleblockinggoal))
)
)
:effect
(and
(jollyalignedtostriker)
)
)
(:action checkjollyposition
:parameters ()
:precondition
(not (jollypositionok))
:effect
(and
(when (fluentisjollyinposition) (jollypositionok))
(when (jollyinposition) (jollypositionok))
)
)
(:action checkjollyrotation
:parameters ()
:precondition
(not (jollyrotationok))
:effect
(and
(when (fluentisjollyalignedtostriker) (jollyrotationok))
(when (jollyalignedtostriker) (jollyrotationok))
)
)
(:action checkjollyready
:parameters ()
:precondition
(and
(jollypositionok)
(jollyrotationok)
)
:effect
(jollyready)
)
)
As we can see in this example, predicates starting with the "fluent" keyword are used. The policy wrapper will automatically consider all predicates starting with the "fluent" keyword as fluents from the environment and automatically look for them in the Fluent Registry. Whenever several branches are outgoing from the same node, the wrapper will evaluate fluents for each branch (each fluent might be constant or functional, in which case evaluation would be an actual depth-first tree visit) and choose the one that satisfies them. Synchronization between robots is ensured by fluents.
PDDL problem
(define (problem jollyfond)
(:domain jollyrobocupfond)
(:init
(jollyavailable)
)
(:goal
(and
(jollyready)
)
)
)
Notice that the goal is overwritten by the PLTLf goal during the compiling process
PLTLf goal
In this example no conditioning was used, as it was supposed to show the expressive power of non-deterministic planning in multi-robot scenarios. The goal is simply the PLTLf transcription of the above goal.
The striker initially reaches the ball, and then, depending on the current situation on the field, it chooses the best policy branch:
If the goal is free (even if a jolly is available for pass), the striker attempts to kick to goal, in the second and third branches from the left.
In the leftmost branch, if the goal is not free and the jolly is not available, the striker tries to dribble past the opponent and bring the ball to the goal, carrying it the whole time (to protect it against the opponent).
In the rightmost branches, chosen if the goal is blocked and a jolly is available, the striker waits for the jolly to be in position, and then it passes the ball. Notice that there is a branch for the case in which the jolly is not yet in position, in which the striker just waits for the jolly.
Formal scenario description (PDDL domain/problem and goal)
PDDL domain
(define (domain striker-robocup-fond)
(:requirements :strips :non-deterministic)
(:predicates
(fluent-obstacle-blocking-goal)
(fluent-jolly-available)
(fluent-jolly-in-position)
(striker-should-dribble-opponent)
(goal-scored)
(ball-passed)
(striker-has-ball)
(striker-can-kick)
)
(:action move-to-ball
:parameters ()
:precondition
(and
(not (striker-has-ball))
)
:effect
(oneof
(and (not (fluent-obstacle-blocking-goal)) (not (fluent-jolly-available)) (not(fluent-jolly-in-position)) (striker-has-ball))
(and (not (fluent-obstacle-blocking-goal)) (fluent-jolly-available) (striker-has-ball))
(and (fluent-obstacle-blocking-goal) (fluent-jolly-available) (not (fluent-jolly-in-position)) (striker-has-ball))
(and (fluent-obstacle-blocking-goal) (fluent-jolly-available) (fluent-jolly-in-position) (striker-has-ball))
(and (fluent-obstacle-blocking-goal) (not(fluent-jolly-available)) (not(fluent-jolly-in-position)) (striker-has-ball))
)
)
(:action carry-ball-to-kick
:parameters ()
:precondition
(and
(not (fluent-obstacle-blocking-goal))
(striker-has-ball)
)
:effect
(striker-can-kick)
)
(:action kick-to-goal
:parameters ()
:precondition
(and
(not (fluent-obstacle-blocking-goal))
(striker-can-kick)
)
:effect
(and
(goal-scored)
)
)
(:action wait-for-jolly
:parameters ()
:precondition
(and
(fluent-obstacle-blocking-goal)
(fluent-jolly-available)
(not(fluent-jolly-in-position))
(not(striker-can-kick))
)
:effect
(oneof
(and (fluent-jolly-in-position) (fluent-jolly-available) (fluent-obstacle-blocking-goal))
(and (not (fluent-jolly-in-position)) (fluent-jolly-available) (fluent-obstacle-blocking-goal))
)
)
(:action pass-ball-to-jolly
:parameters ()
:precondition
(and
(fluent-obstacle-blocking-goal)
(fluent-jolly-in-position)
(fluent-jolly-available)
)
:effect
(and (ball-passed))
)
(:action dribble-opponent
:parameters ()
:precondition
(and
(fluent-obstacle-blocking-goal)
(not(fluent-jolly-available))
)
:effect
(striker-attempting-dribble)
)
)
As we can see in this example, predicates starting with the "fluent" keyword are used. The policy wrapper will automatically consider all predicates starting with the "fluent" keyword as fluents from the environment and automatically look for them in the Fluent Registry. Whenever several branches are outgoing from the same node, the wrapper will evaluate fluents for each branch (each fluent might be constant or functional, in which case evaluation would be an actual depth-first tree visit) and choose the one that satisfies them. Synchronization between robots is ensured by fluents.
PDDL problem
(define (problem striker-fond)
(:domain striker-robocup-fond)
(:init )
(:goal (or (ball-passed) (goal-scored) (striker-attempting-dribble)))
)
PLTLf goal
In this example no conditioning was used, as it was supposed to show the expressive power of non-deterministic planning in multi-robot scenarios. The goal is simply the PLTLf transcription of the above goal.
Notice that once the jolly receives the ball, it automatically becomes the striker, according to the role-assignment algorithm running at a lower level on robots, and the policies are reset and reassigned to the respective robots, so that the scenario starts over but with different fluents.
Adversarial example
Results
The evaluation of our proposed approach presents qualitative and quantitative results.
To qualitatively evaluate also the benefit of the conditioning in a Multi-agent scenario, the field was instead modeled such that the robot would try to move the ball between waypoints located respectively at the initial robot position, the goal target (which has to be reached by the ball as a final goal) and other waypoints regularly arranged along the field. We ran 10 test matches between the guided team and the unconditioned one as a baseline.
The baseline behavior consisted in moving the ball between waypoints by kicking it, leaving it slightly unprotected during approach maneuvers. The human coach was able to notice and exploit this vulnerability by forcing the striker robot to always "dribble" the ball to the right flank before kicking to goal. The ball is therefore moved more slowly but also more protectively, allowing the robot to exploit the slower opponent approach times, occasionally stealing the ball from the opponent. The results are shown in the table below.
In order to evaluate planning time over an increasing planning depth, a simple simulated RoboCup environment was used, with a single robot starting from one side of a soccer field, tasked to bring the ball to the opposing side. Conditioning can be imposed to the whole team, but to evaluate the timing we only analyze the planning performance of the Striker that is the most active role in our architecture. The planning domain is modeled in PDDL as a square grid of waypoints on the field and the robot is only allowed to move the ball between adjacent waypoints. The goal waypoint is placed on the opposite side of the field, such that by increasing the number of grid cells per-side, the length of the shortest path required to reach the goal increases as well and can be used to control the minimal planning depth for a successful plan. The plots below show how significantly the performance can decrease with a different representation of the environment and how the external guidance can impact the performance of the planner. We evaluated the approach using a simplified adjacency grid (on the left), that does not consider diagonally adjacent waypoints, and a complete one (on the right), considering the whole neighborhood. In both scenarios, planning time was evaluated with and without an additional constraint forcing the robot to pass through a waypoint located roughly in the middle of the shortest path, expressed using the "O(is_at ...)" constraint. Grid size (therefore the expected minimal planning depth) varies between 4 and 35 in all cases. In the simplified representation, the conditioning does not significantly affect the planning time. Rather, with the complete waypoint adjacency representation, the planning time increases around 20 times, and the improvement given by the conditioning became relevant, as can be seen in the plot. Furthermore, the conditioning reduces the search space and improves the planner's performance.
In the simplified representation, the conditioning does not significantly affect the planning time. Rather, on the complete representation used in RoboCup, the planning time increases around 20 times, and the improvement given by the conditioning became relevant, as can be seen in the plot. Furthermore, the conditioning reduces the search space and improves the planner’s performance.
Citation
If you use our code for any academic work, please cite its paper.
Musumeci, E., Suriani, V., Antonioni, E., Nardi, D., Bloisi, D.D. (2023). Adaptive Team Behavior Planning Using Human Coach Commands. In: Eguchi, A., Lau, N., Paetzel-Prüsmann, M., Wanichanon, T. (eds) RoboCup 2022:. RoboCup 2022. Lecture Notes in Computer Science(), vol 13561. Springer, Cham. https://doi.org/10.1007/978-3-031-28469-4_10