t-SLAM
<paper>
Overview
Intricate behaviors an organism can exhibit is predicated on its ability to sense and effectively interpret complexities of its surroundings. Relevant information is often distributed between multiple modalities, and requires the organism to exhibit information assimilation capabilities in addition to information seeking behaviors. While biological beings leverage multiple sensing modality for decision making, current robots are overly reliant on visual inputs. In this work, we want to augment our robots with the ability to leverage the (relatively under-explored) modality of touch. To focus our investigation, we study the problem of scene reconstruction where touch is the only available sensing modality. We present Tactile Slam (tSLAM) -- which prepares an agent to acquire information seeking behavior and use implicit understanding of common household items to reconstruct the geometric details of the object under exploration. Using the anthropomorphic 'ADROIT' hand, we demonstrate that tSLAM is highly effective in reconstructing objects of varying complexities within 6 seconds of interactions. We also established the generality of tSLAM by training only on 3D Warehouse objects and testing on ContactDB objects.
Key Insight
Information Seeking
Information Assimilation
Effective understanding of tactile modality under such settings requires (a) Information seeking behaviors, as well as (b) information assimilation capabilities that can consolidate information spread temporally as well as spatially. In this work, we ground these challenges in the context of simultaneous localization and mapping (SLAM) using touch as the only sensing modality.
By Tactile SLAM, we refer to the problem of exploring and reconstructing the geometric details of an object solely using the modality of touch.
tSLAM
tSLAM Architecture: at time step t, an agent takes action a given an occupancy grid, an observation G and robot's joint angle sensors as inputs and ends up in a new state.
The discovery of new occupancy grid is used as intrinsic rewards to train the policy. After H steps, a union of all contact points are fed to an implicit function for detailed reconstruction.
Experimental Setup
We evaluate the performance of tSLAM using Adroit Manipulation Platform with 60 objects from ContactDB dataset.
To evaluate the generalizability of our proposed method, both tactile exploration policy and the implicit feature network are trained using 600 objects from 50 categories of 3D Warehouse.
We focus our analysis on three well established metrics proposed by Lars Mescheder, et al.
Volumetric intersection over union (IoU): Volumetric IoU is the quotient of the volume of the two meshes’ union and their intersection. We obtain unbiased estimates of the volume of the union and intersection by randomly sampling 100k points from the bounding volume and determining if the points lie inside the ground truth / predicted mesh.
Chamfer-L_2: The Chamfer-L_2 distance is the mean of accuracy and completeness metric measured based on the mean distance of points on output mesh to nearest neighbors on ground truth mesh. We use a KD-tree to estimate the corresponding distances. We estimate both distances efficiently by randomly sampling 100k points from both meshes and using a KD-tree to estimate the corresponding distances.
Normal consistency: Normal consistency score is defined as the mean absolute dot product of the normals in one mesh and the normals at the corresponding nearest neighbors in the GT mesh. It measures how well the methods can capture higher order information.
Results
Our method achieves high-fidelity reconstruction of objects with different topologies.
Body
Binoculars
Airplane
Stamp
Headphones
Toothbrush
Hammer
Teapot
Scissors
Knife
Eyeglass
Elephant
Donut
Wristwatch
Gamectrl
Hand
Banana
Mug
Stapler
Train
Wineglas
Rabbit
Bottle
Camera
Instead of converging to a mean behavior, our method can continually explore the unknown part of the objects conditioned on current knowledge.
Baseline Comparison
We compare our method with two baselines.
Random Policy: The policy randomly moves in the action space for tactile exploration.
Heuristic Policy: A heuristically designed policy that induces power grasp from a randomly initialized open position.
Our method ourperforms baselines by a large margin. During exploration stage, our method improve over the performance of Random Policy by 13.46% IoU and Heuristic Policy by 6.00% (Occupancy Grid). With better perception model, our method further improve over the performance of Random Policy 17.50% and Heuristic Policy 11.52% (Reconstruction).
Ablations
To better understand various design decisions, we perform extensive ablation studies investigating different component of our method:
Ours(-Coverage): Our method trained without environment coverage reward, the other parts remain unchanged. This ablation shows the value of environment coverage reward.
Ours(-Discovery): Our method trained without novel parts discovery reward, the other parts remain unchanged. This ablation shows the value of novel parts discovery reward.
Ours(# Points): This ablation replace the discovery reward with new contact points reward.
Ours(# Contact): This ablation only replace the discovery reward with a binary score (i.e. 1 if the hand interact with the object, 0 if not) at each timestep.
Ours(Knn): We replace the discovery reward with Knn reward. Intuitively, we want the agent to explore the most unfamiliar part of the object by maximizing the distance of new contact points with known parts. Thus, we use the mean distance of 5-nns as the reward.
Ours(Disagreement): We replace the discovery reward with a disagreement reward, where we incentive the agent to maximum the disagreement of the prediction an object after each touch. We use Alpha Shape to predict the geometry of an object from partial observable contact points and the disagreement is measured by Chamfer distance.
Ours(Chamfer): Our method can also be trained using a supervised exploration reward, where we incentive the agent to have a better prediction of the object after each timestep. Thus, we replace the discovery reward with an inverse chamfer distance between accumulated point cloud and ground truth.
Comparing the first two bars we note that the curiosity reward is more important to the performance than coverage reward, which validates that our method is conditioned on the partial reconstructed geometry of an object other than aggressively explore the entire state space. Ours (# Contact) performs the worst as its a sparse and extremely noisy reward measure. By increasing the granularity of the reward, Ours (# Points) performs slightly better than Ours (# Contact) but is still not comparable with other ablations since it ignores object geometry information. Ours(Knn) and Ours(Disagreement) achieve similar performance as ours, which shows both Knn and Disagreement rewards encode the geometry information properly. This also proves the robustness of our method, i.e., it is not sensitive to the accurate value of the rewards as long as they encode rich information of the objects.
Accuracy-Time Plot
A shorter interaction time leads to poorer performance. However, the performance begin converging after the agent interactive with the object after 6 seconds.