Visual MPC Framework

Our visual MPC framework involves extracting the keypoints using a keypoint detector, which along with the initial action vector are input to the keypoint dynamics modeled by a differentiable MLP. The dynamics model generates the keypoint trajectory which can then be compared to the goal keypoints using the learned cost. This cost can then be minimized to get an optimal action vector.

Keypoint Detector

The complete architecture for training the keypoint detector comprises of an autoencoder with a structural bottleneck that can extract "significant" 2D locations from the input images. The detector itself is essentially the encoder component of the autoencoder. Following [1], we implement it as a mini version of ResNet-18. The input images used are cropped to a resolution of [240 X 240].

The detector is trained on videos of the robot manipulating the gripped object by only varying the end effector rotation through each episode long data point. Each data point starts at a different robot configuration.

Keypoint Dynamics

ACTION OPTIMIZATION

Once a trajectory of keypoints and joint configurations has been predicted from the initial state with initial actions it is compared to the goal keypoints using the learned cost function. Minimizing this cost function with respect to actions via gradient descent gives the optimal set of actions that move the object to its goal position.

REFERENCES

[1] M. Lambeta, P. Chou, S. Tian, B. Yang, B. Maloon, V. R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammerer, D. Jayaraman, and R. Calandra. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Robotics and Automation Letters, 2020.