Motion2Vec: Semi-Supervised Representation Learning

from Surgical Videos

[paper] [supplementary material] [code] [data]

*coming soon*


Learning meaningful visual representations in an embedding space has potential to facilitate generalization in downstream tasks such as action segmentation and imitation. In this paper, we present a motion-centric representation of surgical manipulation skills from video demonstrations by segmenting them into actions/sub-goals/options in a semi-supervised manner. We present Motion2Vec, an algorithm that learns a deep embedding feature space from video observations by minimizing a metric learning loss in a Siamese network: images from the same action segment are pulled together while pushed away from randomly sampled images from other segments. The videos are iteratively segmented with a recurrent neural network for a given parametrization of the embedding space. We only use a small set of labeled video segments to semantically align the embedding space and assign pseudo-labels to the remaining unlabeled data by inference on the learned model parameters. We demonstrate the use of this representation to imitate surgical suturing kinematic motions in simulation from publicly available videos of the JIGSAWS dataset. Results give 85.9 % segmentation accuracy and 2.6 centimeter error in position per observation on the test set, suggesting performance improvement over several state-of-the-art baselines.

Results Summary

-- Motion2Vec learns a motion-centric representation from video observations by segmenting them into actions/sub-goals/options in a semi-supervised manner. Analogous to Word2Vec [1] and Grasp2Vec [2], motion2vec learns a deep embedding feature space by bringing similar action segments together with metric learning in a Siamese network. The temporal structure in the videos is captured by iteratively segmenting the embedded observations with sequence learning. Predicted sequence segments on unlabeled embedded observations are fed back as pseudo-labels for optimizing the deep metric loss after pre-training the sequence learning model. Consistency, interpretability and supervisory burden are a key concern in imitation and representation learning as it is often difficult to precisely characterize what defines a segment and labeling can be time-consuming. Motion2vec leverages upon a few labeled demonstrations to semantically align the embedding space, in contrast to time-driven unsupervised/self-supervised approaches [3][4].

-- We perform analysis with different combinations of supervised and unsupervised approaches to metric learning using triplet loss, n-pairs, time contrastive network (TCN), temporal cycle consistency (TCC); and sequence learning using recurrent neural network (RNN), conditional random fields (CRF), hidden Markov model (HMM) and hidden semi-Markov model (HSMM). Results suggest that semantically aligning the embedding space with a small set of labels using triplet loss assists both supervised (RNN, CRFs) and unsupervised sequence learning approaches (HMM/HSMMs); whereas using unsupervised metric learning with TCN/TCC works well with supervised sequence learning approaches only (RNN/CRFs) (see Table 1).

-- We evaluate motion2vec to imitate surgical suturing motion on the dual arm da Vinci robot kit (dvrk) in simulation from publicly available videos of the JIGSAWS dataset [5]. Results give better segmentation accuracy of 85.9 % on the leave one user trial out test set than reported in the literature [6]. Semi-supervised learning with motion2vec using 25 % or more labeled demonstrations gives better segmentation accuracy over other competing approaches in Fig. 1. Further, we obtain 2.6 centimeter error in position per observation on the test set during kinematic imitation of the surgical suturing video on dvrk in simulator. Preliminary experiments also suggest the feasibility of kinematic imitation on the real robot from surgical suturing videos (see video). Note that we do not model contact dynamics with the needle and the suturing phantom, and only imitate the suturing motions on the kinematic level.

Motion2Vec for Surgical Suturing Segmentation and Imitation

Embedding space visualization with t-SNE/PCA using time/segment labels

Nearest neighbor imitation on (right) for suturing demonstration on (left). The timeline bars on (top) shows segmentation with ground-truth, RNN (proposed), HMM, HSMM, CRF and KNN predictions.

Preliminary kinematic imitation results on da Vinci Robot from surgical suturing video on (bottom-right)


-- unsupervised embedding: Incremental Principal Component Analysis (iPCA), TCC, TCN supervised embedding: N-Pairs, Triplet Loss

-- unsupervised sequence learning: HMM, HSMM supervised sequence learning: CRFs, RNNs

iPCA does not give consistent spatio-temporal clusters resulting in below par segmentation.

TCC aligns the demonstrations well in time, but does not capture the spatial dependencies.

TCN trades-off between the spatial and temporal grouping of the demonstrations, resulting in sparse clusters.

N-pairs is trained in a similar manner to motion2vec, yielding comparable performance with CRFs and RNNs, while proving less conducive for HMMs/HSMMs.

Table 1: Segmentation accuracy performance comparison on the evaluation set with labeled training set. Rows correspond to a different embedding space approach, columns correspond to a different segmentation method. Motion2Vec (M2V) uses triplet loss with RNN in a semi-supervised manner.

Fig. 1: Effect of percentage of labeled demonstrations on the segmentation accuracy: single view TCN performs better in unsupervised regimes, while motion2vec improves performance over competing approaches with 25% or more labeled demonstrations. Results are averaged over 5 iterations.


[1] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” NIPS, 2013

[2] E. Jang, C. Devin, V. Vanhoucke, and S. Levine, “Grasp2vec: Learning object representations from self-supervised grasping,” arXiv, 2018

[3] P. Sermanet, C. Lynch, J. Hsu, and S. Levine, “Time-contrastive networks: Self-supervised learning from multi-view observation,” arXiv, 2017

[4] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, “Temporal cycle-consistency learning,” arXiv, 2019

[5] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Bejar, D. D. Yuh et al., “Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,” MICCAI Workshop, 2014

[6] N. Ahmidi, L. Tao, S. Sefati, Y. Gao, C. Lea, B. B. Haro, L. Zappella, S. Khudanpur, R. Vidal, and G. D. Hager, “A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery,” IEEE TBE, 2017

Contact Us

Authors: Ajay Tanwani, Pierre Sermanet, Andy Yan, Raghav Anand, Mariano Phielipp, Ken Goldberg

Please send your feedback and suggestions to: ajay.tanwani at

Acknowledgements: We thank Joseph E. Gonzalez, Minho Hwang, Xin Wang, Daniel Seita, Yixin Gao, Jigsaws CIRL team and our collaborators from Google Brain, Intel and SRI for their helpful discussions.