Straight To Shapes: Real-Time Detection of Encoded Shapes

Current object detection approaches predict bounding boxes, but these provide little instance-specific information beyond location, scale and aspect ratio. In this work, we propose to directly regress to objects' shapes in addition to their bounding boxes and categories. It is crucial to find an appropriate shape representation that is compact and decodable, and in which objects can be compared for higher-order concepts such as view similarity, pose variation and occlusion. To achieve this, we use a denoising convolutional auto-encoder to establish an embedding space, and place the decoder after a fast end-to-end network trained to regress directly to the encoded shape vectors. This yields what to the best of our knowledge is the first real-time shape prediction network, running at ~35 FPS on a high-end desktop. With higher-order shape reasoning well-integrated into the network pipeline, the network shows the useful practical quality of generalising to unseen categories that are similar to the ones in the training set, something that most existing approaches fail to handle.

Saumya Jetley*, Michael Sapienza*, Stuart Golodetz and Philip H. S. Torr,
Straight to Shapes: Real-time Detection of Encoded Shapes,
CVPR, 2017.

[pdf - project page - code - BibTex]

Human Action Detection

In this work we propose an approach for the spatio-temporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. We demonstrate the performance of our algorithm on three challenging datasets, achieving new state-of-the-art results across the board and significantly lower detection latency at test time.

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H.S. Torr , Fabio Cuzzlion,
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos,
British Machine Vision Conference, 2016.

[pdf - project page - code - BibTex]

SemanticPaint: Interactive Segmentation and Learning of 3D Worlds

We present a real-time, interactive system for the geometric reconstruction and object-class segmentation of 3D worlds. Using our system, a user can walk into a room wearing a consumer depth camera and a virtual reality headset, and both reconstruct the 3D scene (using hashing-based large-scale fusion) and interactively segment it on the fly into object classes such as 'chair', 'floor' and 'table'. The user interacts physically with the scene in the real world, touching or pointing to objects and using voice commands to assign them appropriate labels. The labels provide supervision for an online random forest that is used to predict labels for previously unseen parts of the scene. 

S. Golodetz, M. Sapienza, J. Valentin, V. Vineet, M. Cheng, V. Prisacariu, O. Kaehler, C. Y. Ren, A. Arnab, S. Hicks, D. W. Murray, S. Izadi, P. H.S. TorrSemanticPaint: Interactive Segmentation and Learning of 3D WorldsProceeding ACM SIGGRAPH 2015 Emerging Technologies, 2015. (live demo)


Audiovisual Semantic Segmentation

In this project, we use sound as an additional sensory modality to perform semantic segmentation. We obtain auditory information by tapping objects in the scene, and find that the sound emitted is characteristic of the material that the object is made of. Using this information, we are able to improve our discrimination of object classes which appear visually similar, but are made of different materials. The sound information that we collected (and release publicly), is only available in sparse locations in the scene. Our CRF model is able to effectively augment dense visual cues with sparse auditory information. 

Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin, Ondrej Miksik, Shahram Izadi, Philip H.S. Torr
Joint Object-Material Category Segmentation from Audio-Visual CuesBritish Machine Vision Conference, 2015. 

Recognising and localising human actions with parts 

We  adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time.

Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, Learning discriminative space-time action parts from weakly labelled videos, International Journal of Computer Vision, 2013.

Recognising and localising human actions

We propose a novel approach to action clip classification and localisation based on the recognition of spacetime subvolumes. In a training step, discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume.

Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, 
Learning discriminative space-time actions from weakly labelled videos, British Machine Vision Conference, 2012. (oral) 

Vision for autonomous mobile robot guidance

This work has focused on a vision-based autonomous guidance algorithm for a mobile robot in a previously unknown environment equipped with a low-quality monocular camera. A novel probabilistic framework has been developed for the classification of traversable image regions using multiple image features and a truncated exponential mixture model. 

This classification framework runs in real-time, and the parameters that drive the traversability segmentation are learned in an on-line, semi-supervised fashion. The vision system has been implemented on the VISAR01 mobile robot and allows it to autonomously guide itself past static or dynamic obstacles in both indoor or outdoor natural environments in a real-time, reactive manner. 

Michael Sapienza, Kenneth P. Camilleri, A generative traversability model for monocular robot self-guidance,
9th Int. Conf. on Informatics in Control, Automation and Robotics, 2012. (oral) 

Media coverage: Times of Malta

[pdf - dissertation - slides - code - dataset BibTex]

Real-time 3D reconstruction and fixation with an active binocular head

In order for a binocular head to perform optimal 3D tracking, it should be able to verge its cameras actively, while maintaining geometric calibration. In this work we introduce a calibration update procedure, which allows a robotic head to simultaneously fixate, track, and reconstruct a moving object in real-time.  

The update method is based on a mapping from motor-based to image-based estimates of the camera orientations, estimated in an offline stage. Following this, a fast online procedure is presented to update the calibration of an active binocular camera pair.

Michael Sapienza, Miles Hansard, Radu Horaud, Real-time visuomotor update of an active binocular head
Autonomous Robots, 34(1):35–45, 2013.

Media coverage: RoboHub

[pdf - BibTex]

Real-time head pose estimation in the six degrees of freedom

This work aims to track a human head and estimate its orientation in the six degrees of freedom. The only sensor will be an un-calibrated, monocular web camera. The principal processing steps include face and facial feature detection, in order to start automatically, tracking of the eye, nose and mouth regions using template matching, and estimating the 3D vector normal to the facial plane from the position of the features in the face. 

This non-intrusive system runs in real-time, starts automatically, and recovers from 
failure automatically, without any previous knowledge of the user appearance or location.

M. Sapienza, K.P. Camilleri, Dept. Systems & Control, Uni. Malta, 2011.

[pdf - dissertationcode - BibTex]