Survivable Hyper-Redundant Robotic Arm with Bayesian Policy Morphing
Sayyed Jaffar Ali Raza, Apan Dastider and Mingjie Lin
Autonomous Computing Lab, University of Central Florida
This paper has been accepted as a conference paper at International Conference on Automation Science and Engineering (CASE) - 2020 proceedings and presentation for oral presentation.
To challenge the prevailing inability in present state of robotic agents to "morph" existing control policy into incrementally new policy to cope with unpredictable bodily damages while this kind of "morphing" to autonomously synthesize new action and behavior has evolved in animals through billion years of Darwinism.
To chalk out the process to turn an existing motion policy into a functioning motion policy to meet the absence or loss of mobility compared to original body of the robot as well as execute the policy determined actions with no lag to meet the requirements of real-time operation.
Integrating the Bayesian Learning framework with the Deep Reinforcement Learning to investigate how Bayesian learning based estimation framework assists an agent to modify its own policy by exploiting its base policy so that it can act accordingly in an unseen working conditions.
Our proposed Bayesian Policy Morphing (BPM) algorithm emphasizes the integration of prior learned motion policy with newly state observations so that an updated new state compatible policy can be synthesized incrementally and probabilistically.
Our BPM algorithm has been applied to tackle unprecedented mechanical damages in a 8 DoFs robotic manipulator and we considered three types of mechanical malfunctions: 1) complete joint frozen, 2) joint always offset by 30 degrees, and 3) joint has stochastic rotational inaccuracy.
After being inflicted with bodily damage, any animal can autonomously synthesize compensatory behaviors, a critical trait evolved through billions of years. In contrast, the vast majority of today’s robotic agents can not autonomously “morph” their original control policies into incrementally new forms in response to sudden mechanical malfunctions, except resorting to limited predefined contingency plans.
In this paper, we aim at developing a hyper-redundant robotic arm that can recover from random mechanical failures autonomously, hence being survivable. To this end, we formulate the framework of Bayesian Policy Morphing (BPM) that enables a robotic entity to self-modify its motion policy after the diminution of its maneuvering dimensionality. Our key idea is to infuse the formal Bayesian learning into the state-ofthe-art deep reinforcement learning for continuous control. Although the idea of Bayesian reinforcement learning is not new, our BPM methodology differs significantly in that, instead of focusing on more accurately estimating value function or policy gradient, our approach exploits Bayesian learning to judiciously guide policy search in the direction biased by prior experience, therefore significantly improving learning efficiency.
With model-based simulations, we have demonstrated that our BPM algorithm can enable our robotic arm to adapt to its damage in almost real-time without requiring self-diagnosis or pre-specified contingency plans. To further solidify our research results, we have programmed an eight-joint robotic arm with our algorithm of BPM, while intentionally disabling its one, two, three, and four joints with different damage types: 1) unresponsive joint, 2) constant offset angle, 3) random angular imprecision. Our results have shown that, even this robotic arm lost the half of its joints due to various malfunction modes, it can still successfully maintain its functionality to accurately locate and grasp a given target object
At first, our robotic agent started to construct a base motion policy with the state-of-the-art deep RL methods considering no faulty joints or malfunctions in its structure. After learning the base policy, we introduced aforementioned 3 types mobility and bodily imbalances to testify our proposed method and check it's superiority in performance and appropriate motion planning to perform assigned task of reaching a goal position to pick an object. The actions are joint angles values of joints and custom reward function has been generated by approximating difference between present position and goal position.
We considered a Gaussian distribution based DQN method as our baseline algorithm to compare in which an incremental control policy learning method termed as self-modelling is actually a Gaussian distribution of its MDP learnt offline. The main problems encountered in Gaussian process are that self-modeled GP are practically infeasible in continuous spaces and computationally expensive for updating parameters in episodic training. On the contrary, our proposed BPM algorithm provides the agent with the behavior ensemble of its model that has abstract information of state-action values. The agent does not update the ensemble, instead it emits a bias hyper-parameter from critic network, by comparing observation and beliefs at each episode.
We introduced three different malfunctions modes and examined how our proposed method is performing to complete its trajectory with these new working conditions.