SemanticPaint: Interactive Segmentation and Learning of 3D Worlds
present a real-time, interactive system for the geometric
reconstruction and object-class segmentation of 3D worlds. Using our
system, a user can walk into a room wearing a consumer depth camera and a
virtual reality headset, and both reconstruct the 3D scene (using
hashing-based large-scale fusion) and interactively segment it on the
fly into object classes such as 'chair', 'floor' and 'table'. The user
interacts physically with the scene in the real world, touching or
pointing to objects and using voice commands to assign them appropriate
labels. The labels provide supervision for an online random forest that
is used to predict labels for previously unseen parts of the scene.
Audiovisual Semantic Segmentation
In this project, we use sound as an additional sensory modality to perform semantic segmentation. We obtain auditory information by tapping objects in the scene, and find that the sound emitted is characteristic of the material that the object is made of. Using this information, we are able to improve our discrimination of object classes which appear visually similar, but are made of different materials. The sound information that we collected (and release publicly), is only available in sparse locations in the scene. Our CRF model is able to effectively augment dense visual cues with sparse auditory information.
Recognising and localising human actions with parts
We adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features in which local discriminative regions are split into a fixed grid of parts that are allowed to deform in both space and time at test-time.
Recognising and localising human actions
We propose a novel approach to action clip classification and localisation based on the recognition of spacetime subvolumes. In a training step, discriminative action subvolumes are learned in a weakly supervised setting, owing to the difficulty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume.
Michael Sapienza, Fabio Cuzzolin, Philip H.S. Torr, Learning discriminative space-time actions from weakly labelled videos, British Machine Vision Conference, 2012. (oral)
Vision for autonomous mobile robot guidance
This work has focused on a vision-based autonomous guidance algorithm for a mobile robot in a previously unknown environment equipped with a low-quality monocular camera. A novel probabilistic framework has been developed for the classification of traversable image regions using multiple image features and a truncated exponential mixture model.
This classification framework runs in real-time, and the parameters that drive the traversability segmentation are learned in an on-line, semi-supervised fashion. The vision system has been implemented on the VISAR01 mobile robot and allows it to autonomously guide itself past static or dynamic obstacles in both indoor or outdoor natural environments in a real-time, reactive manner.
Real-time 3D reconstruction and fixation with an active binocular head
In order for a binocular head to perform optimal 3D tracking, it should be able to verge its cameras actively, while maintaining geometric calibration. In this work we introduce a calibration update procedure, which allows a robotic head to simultaneously fixate, track, and reconstruct a moving object in real-time.
The update method is based on a mapping from motor-based to image-based estimates of the camera orientations, estimated in an offline stage. Following this, a fast online procedure is presented to update the calibration of an active binocular camera pair.
Autonomous Robots, 34(1):35–45, 2013.
Real-time head pose estimation in the six degrees of freedom
This work aims to track a human head and estimate its orientation in the six degrees of freedom. The only sensor will be an un-calibrated, monocular web camera. The principal processing steps include face and facial feature detection, in order to start automatically, tracking of the eye, nose and mouth regions using template matching, and estimating the 3D vector normal to the facial plane from the position of the features in the face.
This non-intrusive system runs in real-time, starts automatically, and recovers from failure automatically, without any previous knowledge of the user appearance or location.
M. Sapienza, K.P. Camilleri, Dept. Systems & Control, Uni. Malta, 2011.
[pdf - dissertation - code - BibTex]