Learning meaningful visual representations in an embedding space can facilitate generalization in downstream tasks such as action segmentation and imitation. In this paper, we learn a motion-centric representation of surgical video demonstrations by grouping them into action segments/sub-goals/options in a semi-supervised manner. We present Motion2Vec, an algorithm that learns a deep embedding feature space from video observations by minimizing a metric learning loss in a Siamese network: images from the same action segment are pulled together while pushed away from randomly sampled images of other segments, while respecting the temporal ordering of the images. The embeddings are iteratively segmented with a recurrent neural network for a given parametrization of the embedding space after pre-training the Siamese network. We only use a small set of labeled video segments to semantically align the embedding space and assign pseudo-labels to the remaining unlabeled data by inference on the learned model parameters. We demonstrate the use of this representation to imitate surgical suturing motions from publicly available videos of the JIGSAWS dataset. Results give 85.5 % segmentation accuracy on average suggesting performance improvement over several state-of-the-art baselines, while kinematic pose imitation gives 0.94 centimeter error in position per observation on the test set.
-- Motion2Vec learns a motion-centric representation from video observations by segmenting them into actions/sub-goals/options in a semi-supervised manner. Analogous to Word2Vec and Grasp2Vec, motion2vec learns a deep embedding feature space by bringing similar action segments together with metric learning in a Siamese network. The temporal structure in the videos is captured by iteratively segmenting the embedded observations with sequence learning. Predicted sequence segments on unlabeled embedded observations are fed back as pseudo-labels for optimizing the deep metric loss after pre-training the sequence learning model. Consistency, interpretability and supervisory burden are a key concern in imitation and representation learning as it is often difficult to precisely characterize what defines a segment and labeling can be time-consuming. Motion2vec leverages upon a few labeled demonstrations to semantically align the embedding space, in contrast to time-driven unsupervised/self-supervised approaches.
-- We perform analysis with different combinations of supervised and unsupervised approaches to metric learning using triplet loss, n-pairs, time contrastive network (TCN), temporal cycle consistency (TCC); and sequence learning using recurrent neural network (RNN), conditional random fields (CRF), hidden Markov model (HMM) and hidden semi-Markov model (HSMM). Results suggest that semantically aligning the embedding space with a small set of labels using triplet loss assists both supervised (RNN, CRFs) and unsupervised sequence learning approaches (HMM/HSMMs); whereas using unsupervised metric learning with TCN/TCC works well with supervised sequence learning approaches only (RNN/CRFs) (see Table 1).
-- We evaluate motion2vec to imitate surgical suturing motion on the dual arm da Vinci robot kit (dvrk) in simulation from publicly available videos of the JIGSAWS dataset. Results give better segmentation accuracy of 85.5 % on the leave one user trial out test set than reported in the literature. Semi-supervised learning with motion2vec using 25 % or more labeled demonstrations gives better segmentation accuracy over other competing approaches in Fig. 1. Further, we obtain 0.94 centimeter error in position per observation on the test set during kinematic imitation of the surgical suturing video on dvrk (see video). Note that we do not model contact dynamics with the needle and the suturing phantom, and only imitate the suturing motions on the kinematic level.
Motion2Vec for Surgical Suturing Segmentation and Imitation
Embedding space visualization with t-SNE/PCA using time/segment labels
Nearest neighbor imitation on (right) for suturing demonstration on (left). The timeline bars on (top) shows segmentation with ground-truth, RNN (proposed), HMM, HSMM, CRF and KNN predictions.
Preliminary kinematic imitation results on da Vinci Robot from surgical suturing video on (bottom-right)
-- unsupervised embedding: Incremental Principal Component Analysis (iPCA), TCC, TCN supervised embedding: N-Pairs, Triplet Loss
-- unsupervised sequence learning: HMM, HSMM supervised sequence learning: CRFs, RNNs
iPCA does not give consistent spatio-temporal clusters resulting in below par segmentation.
TCC aligns the demonstrations well in time, but does not capture the spatial dependencies.
TCN trades-off between the spatial and temporal grouping of the demonstrations, resulting in sparse clusters.
N-pairs is trained in a similar manner to motion2vec, yielding comparable performance with CRFs and RNNs, while proving less conducive for HMMs/HSMMs.
Table 1: Segmentation accuracy performance comparison on the evaluation set with labeled training set. Rows correspond to a different embedding space approach, columns correspond to a different segmentation method. Motion2Vec (M2V) combines triplet loss with RNN in a semi-supervised manner to give consistently better results across all supervised and unsupervised sequence learning approaches. M2V-T combine supervised triplet loss with unsupervised single view time contrastive loss for better nearest neighbor accuracy in the embedding space.
Fig. 1: Effect of percentage of labeled demonstrations on the segmentation accuracy: single view TCN performs better in unsupervised regimes, while motion2vec improves performance over competing approaches with 25% or more labeled demonstrations. Results are averaged over 5 iterations.
Table 2: Pose imitation error in terms of median cosine quaternion loss and root mean squared error (RMSE) in Cartesian 3-dimensional position space in centimeters on the test set. Rows indicate pose imitation of: all surgeons from raw videos (images-all); all surgeons on Motion2Vec (M2V-all); per surgeon pose decoding on M2V (M2V-per). Motion2Vec robustly imitates the kinematic poses than decoding from raw images.
Fig. 2: Pose imitation on da Vinci robot arms from the Motion2Vec embedded video sequences using a feed-forward neural network for left arm on (top) and right arm on (bottom) in comparison to ground-truth and image decoded poses. Results suggest 0.94 centimeter error in position per observation on the evaluation set. Ground-truth in blue, raw videos in green, predicted in red.
Please send your feedback and suggestions to Ajay Tanwani: ajay.tanwani at berkeley.edu
Acknowledgements: We thank our collaborators from UC Berkeley, Google Brain, Intel and SRI for their helpful discussions.