SemanticPaint: Interactive Segmentation and Learning of 3D Worlds 

We present a real-time, interactive system for the geometric reconstruction and object-class segmentation of 3D worlds. Using our system, a user can walk into a room wearing a consumer depth camera and a virtual reality headset, and both reconstruct the 3D scene (using hashing-based large-scale fusion) and interactively segment it on the fly into object classes such as 'chair', 'floor' and 'table'. The user interacts physically with the scene in the real world, touching or pointing to objects and using voice commands to assign them appropriate labels. The labels provide supervision for an online random forest that is used to predict labels for previously unseen parts of the scene. The entire pipeline runs in real time, and the user stays 'in the loop' throughout the process, receiving immediate feedback about the progress of the labelling and interacting with the scene as necessary to refine the predicted segmentation.

Stuart Golodetz, Michael Sapienza, Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, Victor Adrian Prisacariu,     Olaf Kaehler, Carl Yuheng Ren, Anurag Arnab, Stephen Hicks, David W. Murray, Shahram Izadi, Philip H.S. TorrSemanticPaint: Interactive Segmentation and Learning of 3D WorldsProceeding ACM SIGGRAPH 2015 Emerging Technologies, 2015. (live demo) [pdf - project page - BibTex]

Recognising and localising human actions with parts 

We exploit adding spatio-temporal structure in videos since deformable part models have proven highly successful in object recognition. However, whereas objects have clear boundaries which means we can easily define a ground truth for initialisation, 3D space--time actions are inherently ambiguous and expensive to annotate in large datasets. 

Thus, it is desirable to adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time.

Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, Learning discriminative space-time action parts from weakly labelled videos, International Journal of Computer Vision, 2013. [pdf - video - BibTex]

Recognising and localising human actions

We propose a novel approach to action clip classification and localisation based on the recognition of spacetime subvolumes. In a training step, discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. 

The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume. The experimental results demonstrate that our MIL-BoF method achieves comparable performance or improves on the BoF baseline on the most challenging video datasets. 

Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, Learning discriminative space-time actions from weakly labelled videos, British Machine Vision Conference, 2012. (oral) [pdf - slides - talk - video - BibTex]

Vision for autonomous mobile robot guidance

This work has focused on a vision-based autonomous guidance algorithm for a mobile robot in a previously unknown environment equipped with a low-quality monocular camera. A novel probabilistic framework has been developed for the classification of traversable image regions using multiple image features and a truncated exponential mixture model.

This classification framework runs in real-time, and the parameters that drive the traversability segmentation are learned in an on-line, semi-supervised fashion. The vision system has been implemented on the VISAR01 mobile robot and allows it to autonomously guide itself past static or dynamic obstacles in both indoor or outdoor natural environments in a real-time, reactive manner. 

Michael Sapienza, Kenneth P. Camilleri, A generative traversability model for monocular robot self-guidance,
9th Int. Conf. on Informatics in Control, Automation and Robotics, 2012. (oral) [pdf - slides - BibTex]

Real-time 3D reconstruction and fixation with an active binocular head

In order for a binocular head to perform optimal 3D tracking, it should be able to verge its cameras actively, while maintaining geometric calibration. In this work we introduce a calibration update procedure, which allows a robotic head to simultaneously fixate, track, and reconstruct a moving object in real-time.  

The update method is based on a mapping from motor-based to image-based estimates of the camera orientations, estimated in an offline stage. Following this, a fast online procedure is presented to update the calibration of an active binocular camera pair.

The proposed approach is ideal for active vision applications because no image-processing is needed at runtime for the scope of calibrating the system or for maintaining the calibration parameters during camera vergence. We show that this homography-based technique allows an active binocular robot to fixate and track an object, whilst performing 3D reconstruction concurrently in real-time.

Michael Sapienza, Miles Hansard, Radu Horaud, Real-time visuomotor update of an active binocular head, 
Autonomous Robots, 34(1):35–45, 2013. [pdf - video - BibTex]

Real-time head pose estimation in the six degrees of freedom

This work aims to track a human head and estimate its orientation in the six degrees of freedom, which is a fundamental step towards estimating a person's gaze direction. The only sensor will be an un-calibrated, monocular web camera that will keep the user completely free of any devices or wires.

The system is designed around a feature based geometrical technique which utilises correspondences between the eyes, nose and mouth to estimate the head pose. The principal processing steps include face and facial feature detection, in order to start automatically, tracking of the eye, nose and mouth regions using template matching, and estimating the 3D vector normal to the facial plane from the position of the features in the face.

This non-intrusive system runs in real-time, starts automatically, and recovers from failure automatically, without any previous knowledge of the user appearance or location.

M. Sapienza, K.P. Camilleri, Dept. Systems & Control, Uni. Malta, 2011. [video - BibTex]

[Dissertation - Technical Report - Fasthpe Code]