Predicting Throw Timings
Individual Research Topic, Connor Yates
Individual Research Topic, Connor Yates
Multi-robot interactions will be a complex part of any fully autonomous team deployed in the field in the future. Additionally, tasks which are in remote locations (search and rescue, extraterrestrial exploration) or have a large number of agents (autonomous traffic routing, air traffic control) will not be able to explicitly communicate between all teammates. With constraints like this, it is imperative that agents interacting as a team must be able to account for the actions of a teammate with little to no explicit communication.
There are two subtasks for an independent robot inside a team to accomplish before this can happen. First, the robot must determine what the intent is of the teammates. This could be through direct observation, such as by seeing how a teammate moves across a forest, or it could be through sporadic communications of an teammate providing brief status updates. After the intent of the other agent is known, the robot must now incorporate this knowledge into its decision-making process.
In a competitive game such as catch, a robot which can measure the intent of the opponent can proactively account for the opponent's actions. This provides a competitive edge in high speed competitive domains, where planning around the opponents actions allows you to compensate for their high speed actions inherent in games like catch.
The goal of this research is to extend upon dynamic, adaptive opponent modeling so that opponent models can be used for both human and robotic opponents. A major difficulty for this lies in the inexpressiveness of a robotic opponent. Using purely observational techniques to attempt to predict the immediate intent of the opponent will not work on inexpressive robots. Thus, I propose directly modeling the opponent and creating a general opponent model which can be fine-tuned mid-match to capture the opponent intent and predict actions.
Adaptive decision-making is a critical task for intelligent agents in teaming or competitive tasks. From competitive games like poker [4] and soccer [5] or interactive team tasks like human robot interaction [3] and multiagent negotiations [1], agents interacting with external intelligent actors can benefit by modeling and predicting the external actors' behaviors.
Explicitly modeling external agents has been done in these works through Markov decision processes (MDP) [3], reinforcement learning agents [2,4], or statistical methods like Gaussian processes [1] and explicit, domain specific models [5]. These explicit modeling methods have various benefits and drawbacks. In general, they focus on either rapid adaption to the changing environment [4,5] or on operating with unstructured interaction protocols [1,3].
The difficulty of robotic catch is that both operating regimes are required in order to effective play and win against an opponent. There is no set way in which an opponent will throw a ball, unlike playing cards in poker, and the behaviors may change wildly during the game, unlike much teaming exercises.
This research proposes joining the strengths of these two classes of opponent modeling fields into a single opponent modeling paradigm. By using an adaptive model, such as an LSTM (see [4] or MDP in conjunction with predicting within the semi-structured interaction provided in by the game of catch, an agent should be able to quickly predict what action the opponent will take next.
In order to predict the amount of time until the throw, a state machine was implemented to gauge the progression of a right handed, overhand throw. As the human throws the ball over their head, the relative position of their hand and elbow changes in a reliable fashion. The relative transformation between the hand and elbow are gathered using skeleton tracking, a built-in feature of the XBox Kinect. (See Figure 1 for a visualization).
An overhand throw was categorized into five main states:
These states were defined at transitions of the values of the relative transformation. By measuring an average ending hand-elbow transformation for the arm, the distance to that ending position can be calculated by looking at the distance between the current hand-elbow transformation and the goal transformation.
As each relative transformation is calculated, the position is classified into one of the five main states. The four states of interest are what is focused on; if the current pose does not fit any of the four, then it automatically is classified in the None state.
Figure 1: A visualization of human skeleton tracking as the human throws the ball. The right hand is elevated, and the orientation transformation markers are placed on the person's right hand and elbow. Note that the markers are labeled left as they are left in the observing camera's frame.
Having refined the goal to only predict when the throw will occur, the evaluation metrics will look at the prediction error. Predicting a throw later than it occurs will be penalized more heavily, as this means the robot is predicting a throw after it has occurred. In this game of catch, it is better for the robot to react early than to react late.
I will use a prediction error which squares overshooting predictions and linearly considers undershooting predictions.
With Delta t = t_{actual} - t_{predicted}, the error for throw i is:
This research will look at how precise timing predictions for human actions, in this case throwing the ball, can be performed in timing-critical domains.
In order to predict when the human will throw the ball, I have recorded several videos using the Kinect depth camera of myself throwing the ball toward the camera (as if it was the robot).
Figure 2 below show frames from these videos of myself throwing the ball.
I will use this data as the input to the time prediction model.
I will either use this data as is, and leave it in point cloud form, or I will try to reduce it to a less dimensional format, such as reducing it to using skeleton tracking.
Figure 2: Sample frame from point cloud video captured from the Kinect. From this, a skeleton model of the human can be extracted and used to classify several basic situations based on joint geometry. By timing the transitions between rough states until a throw occurs, the agent can determine where in the throwing process the human is, and create a rough time estimate until they throw the ball.
To test the time-guessing state, I first created an estimate of the ending arm position of a right-handed overhand throw. 30 ending positions were averaged into a single position, which was used as the end-arm reference hand-elbow transformation.
A single set of throws was used to create time estimates till the throw for each state. Then 3 additional sets of throws was calculated with this model, to test the generalizable of the state timings. The error function presented in Section~\ref{sec:eval} is used as the error metric in the results.
The 30 averaged arm positions results in an ending transformation pose of
x: 0.00292808 y: 0.09576139 z: 0.29780168
with a standard deviation of
x: 0.00140708 y: 0.00271313 z: 0.00090179
The most important translation component here was the z component, followed by the y component.
Using primarily these components, state transitions were defined by the difference between the relative transformation right now (T_t) and the goal transformation (T_g), T_g - T_t. The threshold values are shown in Table 1.
Based on the initial set of throws, from the start of the wind up state to the throw took around 3 seconds. The other states took around 1 second, 0.5 seconds, and 0 seconds respectively. (The Released state is when the ball is thrown, so naturally it takes 0 seconds).
These time predictions were applied to the three test throws. The state transitions were classified, and the time to throw was calculated after the throw occurred. The relative difference and the error from Section on Evaluation Methods are shown in Table 2.
With the errors detected in Trials 2 and 3, the state prediction times used is only accurate at the initial guess. This is most likely due to the higher tolerances in the times as the throw progresses, due to the small magnitude of the time remaining in the Last Extension and Throwing states. Interestingly, the arm position calculations did not catch the Throwing state in the last two trials. However, this state may be unnecessary as the errors in the Wind Up time prediction are relatively low, and show premise that the final throw may be predicted through the initial windup.
This somewhat satisfies the initial research hypothesis: state-transition based methods show premise in predicting the time of a human action. However, identifying an adequate timeout still needs some work, though I observed the state identification in the initial windup was fairly accurate.
In the future, this work can be improved by incorporating velocity of the human arm into the time prediction. This would enable a wider range of throws to be classified using state-transition methods. Additionally, new throws or human actions could be mapped into state-transition graphs which would broaden this application to many complex, high-speed human action. Again, the key takeaway is that identifying the start of the action results in the most accurate time predictions