Research


Straight To Shapes: Real-Time Detection of Encoded Shapes

In this work, we propose to directly regress to objects' shapes in addition to their bounding boxes/categories. To do this, it is crucial to find a shape representation that is compact and decodable, and in which shapes/objects can be compared. To achieve this, we use a denoising convolutional auto-encoder to establish an embedding space, and place the decoder after a fast end-to-end network trained to regress directly to the encoded shape vectors. This yields the first real-time shape prediction /instance segmentation network, running at ~35 FPS on a high-end desktop. Our network shows the useful practical quality of generalising to unseen categories that are similar to the ones in the training set, something that most existing approaches fail to handle.

Laurynas Miksys, Saumya Jetley, Michael Sapienza, Stuart Golodetz, Philip H.S. Torr, Straight to Shapes++: Real-time Instance Segmentation Made More Accurate, Department of Engineering Science, University of Oxford, 2019. [pdf]

Saumya Jetley*, Michael Sapienza*, Stuart Golodetz and Philip H. S. Torr, Straight to Shapes: Real-time Detection of Encoded Shapes, Computer Vision and Pattern Recognition (CVPR), 2017. [pdf - poster - project page - video - code]


Human Action Detection

In this work we propose an approach for the spatio-temporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. We demonstrate the performance of our algorithm on three challenging datasets, achieving new state-of-the-art results across the board and significantly lower detection latency at test time.

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H.S. Torr , Fabio Cuzzlion, Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos, British Machine Vision Conference, 2016. [pdf - project page - video - code]


SemanticPaint: Interactive Segmentation and Learning of 3D Worlds

We present a real-time, interactive system for the geometric reconstruction and object-class segmentation of 3D worlds. Using our system, a user can walk into a room wearing a consumer depth camera and a virtual reality headset, and both reconstruct the 3D scene and interactively segment it on the fly into object classes such as 'chair', 'floor' and 'table'. The user interacts physically with the scene in the real world, touching or pointing to objects and using voice commands to assign them appropriate labels. The labels provide supervision for an online random forest that is used to predict labels for previously unseen parts of the scene.

Stuart Golodetz, Michael Sapienza, Julien Valentin, et al., SemanticPaint: Interactive Segmentation and Learning of 3D Worlds, Proceeding ACM SIGGRAPH 2015 Emerging Technologies, 2015. (live demo) [pdf - project page - code - facebook] (live demo)

Media coverage: BBC Click - BBC Click best bits - BBC - BBC Asia - GIzmodo - Engadget


Audiovisual Semantic Segmentation

In this project, we use sound as an additional sensory modality to perform semantic segmentation. We obtain auditory information by tapping objects in the scene, and find that the sound emitted is characteristic of the material that the object is made of. Using this information, we are able to improve our discrimination of object classes which appear visually similar, but are made of different materials. The sound information that we collected (and release publicly), is only available in sparse locations in the scene. Our CRF model is able to effectively augment dense visual cues with sparse auditory information.

Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin, Ondrej Miksik, Shahram Izadi, Philip H.S. Torr, Joint Object-Material Category Segmentation from Audio-Visual Cues, British Machine Vision Conference, 2015. [pdf - project page - dataset]



Recognising and localising human actions with parts

We adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time.

Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, Learning discriminative space-time action parts from weakly labelled videos, International Journal of Computer Vision, 2014. [pdf - project page - video - BibTex]


Recognising and localising human actions

We propose a novel approach to action clip classification and localisation based on the recognition of spacetime subvolumes. In a training step, discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume.

Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, Learning discriminative space-time actions from weakly labelled videos, British Machine Vision Conference, 2012. (oral) [pdf - project page - slides - poster - talk - video - BibTex]



Vision for autonomous mobile robot guidance

This work has focused on a vision-based autonomous guidance algorithm for a mobile robot in a previously unknown environment equipped with a low-quality monocular camera. A novel probabilistic framework has been developed for the classification of traversable image regions using multiple image features and a truncated exponential mixture model. This classification framework runs in real-time, and the parameters that drive the traversability segmentation are learned in an on-line, semi-supervised fashion.

Michael Sapienza, Kenneth P. Camilleri, A generative traversability model for monocular robot self-guidance, 9th Int. Conf. on Informatics in Control, Automation and Robotics, 2012. (oral) [pdf - code - dataset - slides - video - BibTex]

Media coverage: Times of Malta


Real-time 3D reconstruction and fixation with an active binocular head

In order for a binocular head to perform optimal 3D tracking, it should be able to verge its cameras actively, while maintaining geometric calibration. In this work we introduce a calibration update procedure, which allows a robotic head to simultaneously fixate, track, and reconstruct a moving object in real-time.

The update method is based on a mapping from motor-based to image-based estimates of the camera orientations, estimated in an offline stage. Following this, a fast online procedure is presented to update the calibration of an active binocular camera pair.

Michael Sapienza, Miles Hansard, Radu Horaud, Real-time visuomotor update of an active binocular head, Autonomous Robots, 34(1):35–45, 2013. [pdf - video - BibTex]

Media coverage: RoboHub


Real-time head pose estimation in the six degrees of freedom

This work aims to track a human head and estimate its orientation in the six degrees of freedom. The only sensor will be an un-calibrated, monocular web camera. The principal processing steps include face and facial feature detection, in order to start automatically, tracking of the eye, nose and mouth regions using template matching, and estimating the 3D vector normal to the facial plane from the position of the features in the face.

This non-intrusive system runs in real-time, starts automatically, and recovers from failure automatically, without any previous knowledge of the user appearance or location.

Michael Sapienza, Kenneth P. Camilleri, Fasthpe: A recipe for quick head pose estimation, Department of Systems & Control, University of Malta, 2011. [pdf - code - video - BibTex]