A Tale of Two Explanations:

Enhancing Human Trust by Explaining Robot Behavior

Abstract

The ability to provide comprehensive explanations of chosen actions is a hallmark of intelligence. Lack of this ability impedes the general acceptance of AI and robot systems in critical tasks. This paper proposes an integrated framework enabling a robot system to learn a complex manipulation from human demonstrations and provide effective explanations of its behavior. An embodied haptic prediction model is trained to extract knowledge from sensory feedback, and a stochastic grammar model is induced to capture the compositional nature of a multi-step task. The two modeling components are integrated for joint inference to perform a manipulation task and provide explanations from both functional and mechanistic aspects. The robot system not only shows the ability to open the bottles used by human demonstrators but also succeeds in opening new, unseen bottles. To examine whether explanations generated by the robot system can foster human trust in the machine, we conducted a psychological experiment in which human participants were provided with different forms of explanations generated by the robot system. We found that comprehensive and real-time visualizations of the robot's internal decisions were more effective in promoting human trust than explanations based on summary text descriptions. In addition, forms of explanation that are best suited to impart trust do not necessarily correspond to those components contributing to the best task performance. This divergence shows a need for the robotics community to integrate model components to enhance both task execution and human trust in machines.

Explanation Videos

Downloads

Figures

Figure 1: Overview of demonstration, learning, evaluation, and explainability. By observing human demonstrations, the robot learns, performs, and explains using both a symbolic representation and a haptic representation. (A) Fine-grained human manipulation data is collected using a tactile glove. Based on the human demonstrations, the model learns (B) symbolic representations by inducing a grammar model that encodes long-term task structure to generate mechanistic explanations, and (C) embodied haptic representations using an autoencoder to bridge the human and robot sensory input in a common space, providing a functional explanation of robot action. These two components are integrated using the (D) GEP for action planning. These processes complement each other in both (E) improving robot performance and (F) generating effective explanations that foster human trust.

Figure 2: Illustration of learning embodied haptic representation and action prediction model. An example of the force information in (A) the human state, collected by the tactile glove, and force information in (C) the robot state, recorded from the force sensors in the robot’s end-effector. The background colors indicate different action segments. (B) Embodied haptic representation and action prediction model. The autoencoder (yellow background) takes a human state, reduces its dimensionality to produce a human embedding, and uses the reconstruction to verify that the human embedding maintains the essential information of the human state. The embodiment mapping network (purple background) takes in a robot state and maps to an equivalent human embedding. The action prediction network (light blue background) takes the human embedding and the current action and predicts what action to take next. Thus, the robot imagines itself as a human based on its own haptic signals and predicts what action to take next.

Figure 3: An example of action grammar induced from human demonstrations. Green nodes represent And-nodes, and blue nodes represent Or-nodes. Probabilities along edges emanating from Or-nodes indicate the parsing probabilities of taking each branch. Grammar model induced from (A) 5 demonstrations, (B) 36 demonstrations, (C) 64 demonstrations. The grammar model in (C) also shows a parse graph highlighted in red, where red numbers indicate temporal ordering of actions.

Figure 4: Robot task performance on different bottles with various locking mechanisms using the symbolic planner, haptic model, and the GEP that integrates both. (A) Testing performance on bottles observed in human demonstrations. Bottle 1 does not have a locking mechanism, Bottle 2 employs a push-twist locking mechanism, and Bottle 3 employs a pinchtwist locking mechanism. (b) Generalization performance on new, unseen bottles. Bottle 4 does not have a locking mechanism, and Bottle 5 employs a push-twist locking mechanism. The bottles used in generalization have similar locking mechanisms but evoke significantly different haptic feedback; see Section S1. Regardless of testing on demonstration or unseen bottles, the best performance is achieved by the GEP that combines the symbolic planner and haptic model.

Figure 5: Explanations generated by the symbolic planner and the haptic model. (A) Symbolic (mechanistic) and haptic (functional) explanations at a0 of the robot action sequence. (B), (C), and (D) show the explanations at times a2, a8, and a9, where ai refers to the ith action. Note that the red on the robot gripper’s palm indicates a large magnitude of force applied by the gripper, and green indicates no force; other values are interpolated. These explanations are provided in real-time as the robot executes.

Figure 6: Illustration of visual stimuli used in human experiment. All five groups observed the RGB video recorded from robot executions, but differed by the access to various explanation panels. (A) RGB video recorded from robot executions. (B) Symbolic explanation panel. (C) Haptic explanation panel. (D) Text explanation panel. (E) A summary of which explanation panels were presented to each group.

Figure 7: Human results for trust ratings and prediction accuracy. (A) Qualitative measures of trust: average trust ratings for the five groups. and (B) Average prediction accuracy for the five groups. The error bars indicate the 95% confidence interval. Across both measures, the GEP performs the best. For qualitative trust, the text group performs most similarly to the baseline group.