Burke, M., Lu, K., Angelov, D. et al. Learning rewards from exploratory demonstrations using probabilistic temporal ranking. Auton Robot 47, 733–751 (2023). https://doi.org/10.1007/s10514-023-10120-w
We propose a pairwise observation ranking reward model to learn to ultrasound scan from sub-optimal demonstration image sequences. Ultrasound scanning is an exploratory process, and requires a period of discovery before an optimal image can be captured. This class of demonstration is unsuitable for most existing approaches to inverse reinforcement learning (IRL), in particular maximum entropy IRL.
Obtaining stable ultrasound images requires contact with a deformable body at an appropriate position and contact force, with image quality affected by the thickness of the ultrasound gel between the body and the probe, and where air pockets obscure object detection. This means that human demonstrations are inherently sub-optimal, as they require that a demonstrator actively search for target objects, while attempting to locate a good viewpoint position and appropriate contact force. High quality ultrasound images (above left) captured by a human demonstrator show high intensity contour outlines, centre the target object of interest, and generally provide some indication of target object size.
We use time as a supervisory signal, sampling image observations from kinesthetic demonstrations and generating pairwise comparison outcomes using a probabilistic ranking model. A maximum likelihood approximation (left) to this model can be trained in an end-to-end fashion.
This ranking model substantially outperforms maximum entropy IRL and an increasing linear IRL model when demonstrations are sub-optimal, requiring a searching phase before optimisation, in grid world experiments, and performing similarly to maximum entropy IRL when optimal demonstrations are available.
Ground truth
Probabilistic temporal ranking reward
Average return: 9.51 ± 4.92
GP Max Entropy reward
Average return: 9.58 ± 4.90
GP Linear Increasing Reward
Average return: 7.39 ± 5.72
Ground truth
Probabilistic temporal ranking reward
Average return: 7.42 ± 4.82
GP Max Entropy reward
Average return: 3.31 ± 4.24
GP Linear Increasing Reward
Average return: 2.77 ± 4.30
When used for autonomous ultrasound scanning with a Bayesian optimisation search strategy, the proposed ranking reward is able to find final scan images containing a target object (tumour-like mass) more frequently than with maximum entropy IRL. The maximum entropy approach fails more frequently than probabilistic temporal ranking, and when detection is successful, tends to find off-centre viewpoints, and only images small portions of the target object.
The Bayesian optimisation strategy successfully builds a reward map over a search volume, identifying the best position (green) from which to capture ultrasound scans.
Ultrasound scanning involves a highly uncertain and dynamic domain. Obtaining stable ultrasound images requires contact with a deformable imaging phantom at an appropriate position and contact force, with image quality affected by the thickness of the ultrasound gel between the phantom and the probe, while air pockets within the phantom object can obscure object detection. Moreover, air pockets and gel can move in response to manipulator contact.
Using the proposed reward model, the Bayesian optimisation policy finds optimal imaging positions and successfully identifies the target object, unlike maximum entropy IRL which fails to locate the target.
When benchmarked against human image ratings, probabilistic temporal ranking performs particularly well, with high levels of agreement, and better then approaches like maximum entropy inverse reinforcement and T-REX learning which struggle with high dimensional, exploratory demonstrations and limited training data.