Robotic ultrasound scanning: Learning from exploratory demonstrations using probabilistic temporal ranking

Burke, M., Lu, K., Angelov, D. et al. Learning rewards from exploratory demonstrations using probabilistic temporal ranking. Auton Robot 47, 733–751 (2023). https://doi.org/10.1007/s10514-023-10120-w


We propose a pairwise observation ranking reward model to learn to ultrasound scan from sub-optimal demonstration image sequences. Ultrasound scanning is an exploratory process, and requires a period of discovery before an optimal image can be captured. This class of demonstration is unsuitable for most existing approaches to inverse reinforcement learning (IRL), in particular maximum entropy IRL.

Obtaining stable ultrasound images requires contact with a deformable body at an appropriate position and contact force, with image quality affected by the thickness of the ultrasound gel between the body and the probe, and where air pockets obscure object detection. This means that human demonstrations are inherently sub-optimal, as they require that a demonstrator actively search for target objects, while attempting to locate a good viewpoint position and appropriate contact force. High quality ultrasound images (above left) captured by a human demonstrator show high intensity contour outlines, centre the target object of interest, and generally provide some indication of target object size.

Probabilistic temporal ranking

We use time as a supervisory signal, sampling image observations from kinesthetic demonstrations and generating pairwise comparison outcomes using a probabilistic ranking model. A maximum likelihood approximation (left) to this model can be trained in an end-to-end fashion.

Grid-world experiments

This ranking model substantially outperforms maximum entropy IRL and an increasing linear IRL model when demonstrations are sub-optimal, requiring a searching phase before optimisation, in grid world experiments, and performing similarly to maximum entropy IRL when optimal demonstrations are available.

Optimal demonstrations

Ground truth

Probabilistic temporal ranking reward

Average return: 9.51 ± 4.92

GP Max Entropy reward

Average return: 9.58 ± 4.90

GP Linear Increasing Reward

Average return: 7.39 ± 5.72

Sub-optimal demonstrations

Ground truth

Probabilistic temporal ranking reward

Average return: 7.42 ± 4.82

GP Max Entropy reward

Average return: 3.31 ± 4.24

GP Linear Increasing Reward

Average return: 2.77 ± 4.30

Autonomous ultrasound scanning

When used for autonomous ultrasound scanning with a Bayesian optimisation search strategy, the proposed ranking reward is able to find final scan images containing a target object (tumour-like mass) more frequently than with maximum entropy IRL. The maximum entropy approach fails more frequently than probabilistic temporal ranking, and when detection is successful, tends to find off-centre viewpoints, and only images small portions of the target object.

Probabilistic temporal ranking

Maximum entropy reward

The Bayesian optimisation strategy successfully builds a reward map over a search volume, identifying the best position (green) from which to capture ultrasound scans. 

Rewards predicted for trapezoid scans are used to explore scan volume

Reward map over scan volume used best position for ultrasound scanning

Saliency maps show reward association with target object

Bayesian optimisation policy

Ultrasound scanning involves a highly uncertain and dynamic domain. Obtaining stable ultrasound images requires contact with a deformable imaging phantom at an appropriate position and contact force, with image quality affected by the thickness of the ultrasound gel between the phantom and the probe, while air pockets within the phantom object can obscure object detection. Moreover, air pockets and gel can move in response to manipulator contact.

Using the proposed reward model, the Bayesian optimisation policy finds optimal imaging positions and successfully identifies the target object, unlike maximum entropy IRL which fails to locate the target.

Reward for sequences at training time shows model captures non-monotonically increasing exploration processes

When benchmarked against human image ratings, probabilistic temporal ranking performs particularly well, with high levels of agreement, and better then approaches like maximum entropy inverse reinforcement and T-REX learning which struggle with high dimensional, exploratory demonstrations and limited training data.