Research

SemanticPaint: Interactive Segmentation and Learning of 3D Worlds

We present a real-time, interactive system for the geometric reconstruction and object-class segmentation of 3D worlds. Using our system, a user can walk into a room wearing a consumer depth camera and a virtual reality headset, and both reconstruct the 3D scene (using hashing-based large-scale fusion) and interactively segment it on the fly into object classes such as 'chair', 'floor' and 'table'. The user interacts physically with the scene in the real world, touching or pointing to objects and using voice commands to assign them appropriate labels. The labels provide supervision for an online random forest that is used to predict labels for previously unseen parts of the scene. 

S. Golodetz, M. Sapienza, J. Valentin, V. Vineet, M. Cheng, V. Prisacariu, O. Kaehler, C. Y. Ren, A. Arnab, S. Hicks, D. W. Murray, S. Izadi, P. H.S. TorrSemanticPaint: Interactive Segmentation and Learning of 3D WorldsProceeding ACM SIGGRAPH 2015 Emerging Technologies, 2015. (live demo)


 

Audiovisual Semantic Segmentation

In this project, we use sound as an additional sensory modality to perform semantic segmentation. We obtain auditory information by tapping objects in the scene, and find that the sound emitted is characteristic of the material that the object is made of. Using this information, we are able to improve our discrimination of object classes which appear visually similar, but are made of different materials. The sound information that we collected (and release publicly), is only available in sparse locations in the scene. Our CRF model is able to effectively augment dense visual cues with sparse auditory information. 


Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin, Ondrej Miksik, Shahram Izadi, Philip H.S. Torr
Joint Object-Material Category Segmentation from Audio-Visual CuesBritish Machine Vision Conference, 2015. 



Recognising and localising human actions with parts 


We  adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time.

Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, Learning discriminative space-time action parts from weakly labelled videos, International Journal of Computer Vision, 2013.

[pdf - BibTex]




Recognising and localising human actions


We propose a novel approach to action clip classification and localisation based on the recognition of spacetime subvolumes. In a training step, discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume.


Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, 
Learning discriminative space-time actions from weakly labelled videos, British Machine Vision Conference, 2012. (oral) 



Vision for autonomous mobile robot guidance

This work has focused on a vision-based autonomous guidance algorithm for a mobile robot in a previously unknown environment equipped with a low-quality monocular camera. A novel probabilistic framework has been developed for the classification of traversable image regions using multiple image features and a truncated exponential mixture model. 

This classification framework runs in real-time, and the parameters that drive the traversability segmentation are learned in an on-line, semi-supervised fashion. The vision system has been implemented on the VISAR01 mobile robot and allows it to autonomously guide itself past static or dynamic obstacles in both indoor or outdoor natural environments in a real-time, reactive manner. 

Michael Sapienza, Kenneth P. Camilleri, A generative traversability model for monocular robot self-guidance,
9th Int. Conf. on Informatics in Control, Automation and Robotics, 2012. (oral) 

Media coverage: Times of Malta

[pdf - dissertation - slides - code - dataset BibTex]




Real-time 3D reconstruction and fixation with an active binocular head


In order for a binocular head to perform optimal 3D tracking, it should be able to verge its cameras actively, while maintaining geometric calibration. In this work we introduce a calibration update procedure, which allows a robotic head to simultaneously fixate, track, and reconstruct a moving object in real-time.  

The update method is based on a mapping from motor-based to image-based estimates of the camera orientations, estimated in an offline stage. Following this, a fast online procedure is presented to update the calibration of an active binocular camera pair.

Michael Sapienza, Miles Hansard, Radu Horaud, Real-time visuomotor update of an active binocular head
Autonomous Robots, 34(1):35–45, 2013.

Media coverage: RoboHub

[pdf - BibTex]



Real-time head pose estimation in the six degrees of freedom

This work aims to track a human head and estimate its orientation in the six degrees of freedom. The only sensor will be an un-calibrated, monocular web camera. The principal processing steps include face and facial feature detection, in order to start automatically, tracking of the eye, nose and mouth regions using template matching, and estimating the 3D vector normal to the facial plane from the position of the features in the face. 

This non-intrusive system runs in real-time, starts automatically, and recovers from 
failure automatically, without any previous knowledge of the user appearance or location.

M. Sapienza, K.P. Camilleri, Dept. Systems & Control, Uni. Malta, 2011.

[pdf - dissertationcode - BibTex]