Intermittent Visual Servoing: Efficiently Learning Policies Robust to Instrument Changes for High-precision Surgical Manipulation

Samuel Paradis, Minho Hwang, Brijen Thananjeyan, Jeffrey Ichnowski, Daniel Seita, Danyal Fer, Thomas Low, Joseph E. Gonzalez, Ken Goldberg

Paper: [Link]

Intermittent Visual Servoing

We propose IVS, a framework for automation of high-precision surgical subtasks by learning sample efficient, accurate, closed-loop policies that operate directly on visual feedback instead of the robot's encoder estimates. IVS combines coarse planning over a robot model with learning-based, visual feedback control at segments of the task that require high precision. IVS attains the highest published success rates for automated surgical peg transfer, and maintains performance across different instruments. Developing controllers that are efficiently transferable across instruments is critical to automation of surgical subtasks, since instruments are exchanged frequently in the surgical setting and instrument properties change over time with increased usage.


Automation of surgical tasks using cable-driven robots is challenging due to backlash, hysteresis, and cable tension, and these issues are exacerbated as surgical instruments must often be changed during an operation. In this work, we propose a framework for automation of high-precision surgical tasks by learning sample efficient, accurate, closed-loop policies that operate directly on visual feedback instead of robot encoder estimates. This framework, which we call intermittent visual servoing (IVS), intermittently switches to a learned visual servo policy for high-precision segments of repetitive surgical tasks while relying on a coarse open-loop policy for the segments where precision is not necessary. To compensate for cable-related effects, we apply imitation learning to rapidly train a policy that maps images of the workspace and instrument from a top-down RGB camera to small corrective motions. We train the policy using only 180 human demonstrations that are roughly 2 seconds each. Results on a da Vinci Research Kit suggest that combining the coarse policy with half a second of corrections from the learned policy during each high-precision segment improves the success rate on the Fundamentals of Laparoscopic Surgery peg transfer task from 72.9% to 99.2%, 31.3% to 99.2%, and 47.2% to 100.0% for 3 instruments with differing cable-related effects. In the contexts we studied, IVS attains the highest published success rates for automated surgical peg transfer and is significantly more reliable than previous techniques when instruments are changed.


Switching surgical instruments during surgery is both necessary and common. Depending on the type of procedure, up to four instruments may be exchanged on a single arm in rapid succession to perform a task and these cycles occur multiple times over a given procedure. These exchanges have been demonstrated to contribute to 10 to 30% of total operative time, increasing patient exposure to anesthesia. Additionally, each instrument is only permitted to be used for 10 operations regardless of the operation length due to potential instrument degradation and even within this permitted-use window instruments frequently fail. Moreover, between patients, instruments must undergo high pressure, high heat sterilization that further degrades the instrument. Finally, instrument collisions during a procedure are common and can alter the cabling properties of the arm, necessitating re-calibration in the case of automated surgery. Sophisticated, instrument-specific calibration techniques require many long trajectories of data, which further increases the wear on the instrument, reduces its lifespan, and can require time during or before a surgical procedure to collect data.

Developing controllers that are efficiently transferable across instruments is critical to automation of surgical tasks, since instruments are exchanged frequently and instrument properties change over time with increased usage.

FLS Peg Transfer

We focus on the FLS peg transfer task, using red 3D printed blocks and a red 3D printed pegboard. The task involves transferring 6 blocks from the 6 left pegs to the 6 right pegs, and transferring them back from the right pegs to the left pegs. We define the peg transfer task as consisting of a series of smaller subtasks, with the following success criteria:

  • Pick: the surgical robot grasps a block and lifts it off the pegboard.

  • Place: the surgical robot securely places a block around a target peg.

Data Collection

Using teleoperation, we collect 15 corrective trajectories on each of the 12 pegs for both the pick and place task, resulting in 180 expert transfers, and 360 corrective expert trajectories. We preprocess the images in two ways: (1) crop a 150x150 image around the center of the target peg, and (2) color-crop out all red pixels outside of a block-sized radius from the center of the peg.

Supervision Extraction

The extracted action, indicated by the arrows, is a vector of length λ in the direction towards the next waypoint on the expert trajectory that is at least λ from the current waypoint. The extracted termination label, indicated by the color of the waypoint, is 1 if the distance to the final position in the trajectory is less than ν, and 0 otherwise. Each waypoint corresponds to an image, and each image receives both a corrective action label and a termination signal label.

Preprocessed images from a corrective place trajectory with labels extracted using method described above.

Visual Feedback Policy

The policy consists of an ensemble of 4 Convolutional Neural Networks. Each model uses a processed 150x150 image as input, and regresses the corrective direction and classifies the completion status. The weights in the convolutional layers are shared, as useful convolutional filters are likely similar across tasks, while the weights in the dense layers are independent. Sharing convolutional layers provides two sources of supervision when training the filters.

Once trained, we evaluate the ensemble of models in parallel with a filtered RGB image. The predicted action is the mean action across the ensemble, and the predicted termination condition checks if at least 3 of the 4 models predict termination with greater than probability 0.70.

We move the robot in the outputted direction until the termination condition is met. The direction and termination condition are updated 10 times per second.

IVS Example

Example of IVS correcting cable-related errors.

Small blue circles are added to highlight an optimal pick location on the block, and small green circles to highlight the location of the pegs.

Top: IVS corrects a pickup error with 5 corrective updates in 0.6 seconds. The positioning of the robot is off due to the inaccuracy of the coarse controller, as the end-effector is not positioned over the block (Frame 1). We then switch to the learned controller, and visual servoing guides the end-effector over a pick point (Frames 2-5), and once determined a safe pickup is possible, picks the block successfully (Frame 6).

Bottom: IVS corrects a placement error with 12 corrective updates in 1.2 seconds. The positioning of the robot is off due to the inaccuracy of the coarse controller, as the is peg not under the block (Frame 1). We then switch to the learned controller, and visual servoing guides the block over the peg (Frames 2-5), and once determined a safe situation to drop, places the block successfully (Frame 6).


Uncalibrated Baseline (UNCAL)

This is a coarse open-loop policy, implemented using the default dVRK controller with no modifications. The trajectories are tracked in closed-loop with respect to the robot's odometry, but open-loop with respect to vision. Due to cable-related effects such as backlash, hysteresis, and cable tension, this method often struggles to consistently complete high precision tasks because its odometry will erroneously signal that it has reached the target position.

Calibrated Baseline (CAL)

This is a calibrated open-loop policy implemented directly from Hwang et al. by using the same source code as used by the authors. This is the current state-of-the-art method for automating peg transfer. In order to correct for backlash, hysteresis, and cable tension, the authors train a recurrent dynamics model to estimate the true position of the robot based on prior commands. Similar to the uncalibrated baseline, the robot tracks reference trajectories to target grasp and place locations in closed loop with respect to the position estimated by the recurrent model, but open loop with respect to visual inputs.

Peg Transfer Baseline Comparison

Benchmark comparing performance of IVS to the baselines described in Section V. IVS beats both baselines in terms of pick success rate, place success rate, and overall transfer success rate. Due to using RGB imaging, we are able to take many corrective steps per second without stopping, minimizing additions to the mean transfer time.

Instrument Transfer Comparison

Benchmark comparing peg transfer performance of the uncalibrated baseline, calibrated baseline, and IVS across 3 different surgical instruments with unique cabling characteristics. To conduct instrument transfer evaluation, we experiment using 3 different surgical instruments. Each instrument has inconsistent cabling characteristics resulting in various success rates. We investigate whether models learned from data using one instrument can transfer to another instrument without modification. This is challenging, because different surgical robotic instruments, even of the same type, have different cabling properties due to differences in wear and tear. We hypothesize that errors in executing the corrective motion can be mitigated over time by executing additional correct motions, as long as the cumulative error is decreasing. However, the calibrated baseline uses an observer model that explicitly predicts the motion of the robot based on prior commands, which requires learning the dynamics of the specific instrument used in training which may not be sufficiently accurate on a new instrument. We observe that the IVS model trained on instrument A does not decrease in performance on different instruments, while the calibrated baselines suffer significantly on different instruments.

Efficiency Benchmark

Benchmark analyzing efficiency of IVS on various instruments. IVS adds about 1.22 seconds per transfer. The goal is to produce higher success rates, rather than to reduce the timing. However, we find that the proposed method does not use too much extra time compared to the baselines. Due to fast frame capture and continuous servoing via RGB imaging, we are able to update the robot's velocity and check for termination 10 times per second, minimizing additions to the mean transfer time. As a result, the mean transfer time is only 1.5s slower than the uncalibrated baseline, and 0.7s slower than the calibrated baseline.

Across 30 trials, using 3 different instruments, consisting of 360 total transfers, IVS has 2 total failures. Both failures are during picks, due to the tool colliding with the block. Thus, in total...

IVS succeeds on 358/360 picks, and 358/358 places, for a 99.4% transfer success rate.

Video Summary


Video Archive

Arm A






Uncalibrated Baseline

Arm B






Uncalibrated Baseline

Arm C






Uncalibrated Baseline