Anonymous author(s)
Despite progress in both traditional dexterous grasping pipelines and recent Vision-Language-Action (VLA) approaches, the grasp execution stage remains prone to pose inaccuracies, especially in long-horizon tasks, which undermines overall performance. To address this “last-mile” challenge, we propose TacRefineNet, a tactile-only framework that achieves fine in-hand pose refinement of known objects in arbitrary target poses using multi-finger fingertip sensing. Our method iteratively adjusts the end-effector pose based on tactile feedback, aligning the object to the desired configuration. We design a multi-branch policy network that fuses tactile inputs from multiple fingers along with proprioception to predict precise control updates. To train this policy, we combine large-scale simulated data from a physics-based tactile model in MuJoCo with real-world data collected from a physical system. Comparative experiments show that pretraining on simulated data and fine-tuning with a small amount of real data significantly improves performance over simulation-only training. Extensive real-world experiments validate the method’s effectiveness, achieving millimeter-level grasp accuracy using only tactile input. To our knowledge, this is the first method to enable arbitrary in-hand pose refinement via multi-finger tactile sensing alone.
We conduct experiments under diverse initial and target in-hand poses. The poses are parameterized along 4 dimensions: pitch, roll, y, and z. For each trial, we randomly select one dimension and set the initial pose, while the target pose is chosen to be its symmetric counterpart. This results in 16 distinct pose pairs.
To evaluate the robustness of our method in dynamic scenarios, we conduct a long-horizon object tracking experiment. A fixed tactile image is provided as the target, while the object’s pose and position are continuously perturbed throughout the sequence. The goal is to assess whether our system can consistently adjust to maintain the desired grasp. The results demonstrate that our method can reliably perform fine-grained grasping toward a specified target pose, even under continuous variations in object pose.
To assess the generalization of our method, we evaluate the policy on unseen objects. As can be seen in figure, the selected object has similar geometry but varies in shape and thickness. The dexterous hand is initialized in a random in-hand pose and tasked to achieve a user-specified target pose using only tactile feedback. Figure shows representative examples of successful in-hand pose adjustment on the unseen object. The results indicate that the learned policy generalizes well to novel flat objects, particularly in roll adjustments.