We consider a scene with a static camera in an outdoor environment overlooking the approach route and entrance to a building.
In this setting, we propose a threat prediction method using machine vision algorithms, which are processing the video stream in real time to understand the observed action. The resulting method distinguishes normal activity from a possible threat using a score calculated from a combination of visual and semantic features extracted from the image sequence using deep neural networks. On the top level, the inference is based on a probabilistic model and particle filter to simulate different scenarios given the observed data from low-level deep network detections, tracking and classification.
Based on the predicted threat score and confidence, the camera system will either directly trigger the alarm or challenge the potential intruder with an identification request.
Threat Assessment System.
UK Patent, 2023.
Refining raw disparity maps from different algorithms to exploit their complementary advantages is still challenging. Uncertainty estimation and complex disparity relationships among pixels limit the accuracy and robustness of existing methods, and there is no standard method for fusion of different kinds of depth data. In this paper, we introduce a new method to fuse disparity maps from different sources, while incorporating supplementary information (intensity, gradient etc.) into a refiner network to better refine raw disparity inputs. A discriminator network classifies disparities at different receptive fields and scales. Assuming a Markov Random Field for the refined disparity map produces better estimates of the true disparity distribution. Both fully supervised and semi-supervised versions of the algorithm are proposed. The approach includes a more robust loss function to inpaint invalid disparity values and requires much less labelled data to train in the semi-supervised learning mode. The algorithm can be generalized to fuse depths from different kinds of depth sources.
SDF-MAN: Semi-supervised Disparity Fusion with Multi-scale Adversarial Networks
Remote Sensing, February 2019. Video. Code.
Accurately registering point clouds from a cheap, low-resolution sensor is a challenging task. Existing rigid registration methods failed to use the physical 3D uncertainty distribution of each point from a real sensor in the dynamic alignment process. It is mainly because the uncertainty model for a point is static and invariant, and it is hard to describe the change of these physical uncertainty models in different views. Additionally, the existing Gaussian mixture alignment architecture cannot efficiently implement these dynamic changes. We propose a simple architecture combining error estimation from sample covariances and dynamic global probability alignment using the convolution of uncertainty-based Gaussian Mixture Models (GMM).
DUGMA: Dynamic Uncertainty-Based Gaussian Mixture Alignment
Proc. of 3D Vision, September 2018. Video. Code.
While pose estimation with visual SLAM can be highly accurate, it is not guaranteed to provide the smooth pose estimate that navigation algorithms expect, due to outliers in the measurement, noise in the pose estimate, etc. For this reason, it has become common to include a filter that can use inertial sensors mounted on the vehicle and a motion model to constrain the estimated trajectory of the vehicle to be smooth. We propose an extension to the Extended Kalman Filter framework, which can cope with irregularities of SLAM measurements without access to its internal characteristics. The particular techniques implemented for this purpose are outlier rejection based on estimation of measurement covariance from past measurements, penalization of lags and soft filter reset.
Covariance Estimation for Robust Visual-Inertial Odometry in Presence of Outliers and Lags
Technical Report, 2019. Video. Code.
The advance of scene understanding methods based on machine learning relies on the availability of large ground truth datasets, which are essential for their training and evaluation. Construction of such datasets with imagery from real sensor data however typically requires much manual annotation of semantic regions in the data, delivered by substantial human labour. To speed up this process, we propose a framework for semantic annotation of scenes captured by moving camera(s), e.g., mounted on a vehicle or robot. It makes use of an available 3D model of the traversed scene to project segmented 3D objects into each camera frame to obtain an initial annotation of the associated 2D image, which is followed by manual refinement by the user. The refined annotation can be transferred to the next consecutive frame using optical flow estimation.
Consistent Semantic Annotation of Outdoor Datasets via 2D/3D Label Transfer
We have developed a working pipeline which integrates hardware and software in an automated robotic rose cutter; to the best of our knowledge, the first robot able to prune rose bushes in a natural environment. Unlike similar approaches like tree stem cutting, the proposed method does not require scanning the full plant, have multiple cameras around the bush, or assume that a stem does not move. It relies on a single stereo camera mounted on the end-effector of the robot and real-time visual servoing to navigate to the desired cutting location on the stem.
Real-time Stereo Visual Servoing for Rose Pruning with Robotic Arm
Proc. of ICRA, 2020. Video.
Visual servoing is a well-known task in robotics. However, there are still challenges when multiple visual sources are combined to accurately guide the robot or occlusions appear. We present a novel visual servoing approach using hybrid multi-camera input data to lead a robot arm accurately to dynamically moving target points in the presence of partial occlusions. The approach uses four RGBD sensors as Eye-to-Hand (EtoH) visual input, and an arm-mounted stereo camera as Eye-in-Hand (EinH). A Master supervisor task selects between using the EtoH or the EinH, depending on the distance between the robot and target.
Hybrid Multi-camera Visual Servoing to Moving Target
Proc. of IROS, September 2018. Video.
The problem of finding a next best viewpoint for 3D modelling or scene mapping has been explored in computer vision over the last decade. We propose a method for dynamic next best viewpoint recovery of a target point while avoiding possible occlusions. Since the environment can change, the method has to iteratively find the next best view with a global understanding of the free and occupied parts. We model the problem as a set of possible viewpoints which correspond to the centres of the facets of a virtual tessellated hemisphere covering the scene. Taking into account occlusions, distances between current and future viewpoints, quality of the viewpoint and joint constraints (robot arm joint distances or limits), we evaluate the next best viewpoint.
Best Viewpoint Tracking for Camera Mounted on Robotic Arm with Dynamic Obstacles
Proc. of 3D Vision, October 2017. Presentation. Video.
My PhD thesis deals with application of symmetry principles to computer vision problems of object detection in images. The focus is put on the ways how our prior knowledge on translation, reflection and rotation symmetries can be encoded in probabilistic models. Conceptually, the position of our object-centered approach lies between general symmetry detection and strongly informed procedural modelling.
In addition to the previous research on translation, we explore the recognition reflection and rotation symmetries. At this time, the Bayesian inference is used to handle a hierarchical model extending from the low-level geometry of reflection symmetry to dihedral symmetry groups. Objectness and compactness priors are included to reduce ambiguity in the detection. The increased complexity of the model is compensated by utilization of an advanced inference method, which allows to rigorously reason about the number of detected components by means of model selection. In the result, we show this approach improves performance on standard datasets, particularly in the case when multiple objects are present.
Probabilistic Models for Symmetric Object Detection in Images
Czech Technical University, November 2015. Presentation.
We propose a method for semantic parsing of images with regular structure. The structured objects are modelled in a densely connected CRF. The paper describes how to embody specific spatial relations in a representation called Spatial Pattern Templates (SPT), which allows us to capture regularity constraints of alignment and equal spacing in pairwise and ternary potentials.
Assuming the input image is pre-segmented to salient regions, the SPT describe which segments could interact in the structured graphical model. The model parameters are learnt to describe the formal language of semantic labellings. Given an input image, a consistent labelling over its segments linked in the CRF is recognized as a word from this language.
The CRF framework allows us to apply efficient algorithms for both recognition and learning. We demonstrate the approach to the problem of facade image parsing and show that results comparable with state-of-the-art methods.
Spatial Pattern Templates for Recognition of Objects with Regular Structure
Proc. of GCPR, September 2013. (oral) Presentation.
Dataset: CMP Facade Database
We present a method for recognition of structured images and demonstrate it on the detection of windows in facade images. Given an ability to obtain local low-level data evidence on primitive elements of a structure (like a window in a facade image), we determine their most probable number, attribute values (location, size) and neighbourhood relation.
The embedded structure is weakly modelled by pair-wise attribute constraints, which allow structure and attributes to mutually support each other. We use a very general framework of reversible jump MCMC, which allows simple implementation of a specific structure model and plug-in of almost arbitrary element classifiers.
We have chosen the domain of window recognition in facade images to demonstrate that the result is an efficient algorithm, achieving performance of other strongly informed methods for regular structures.
Stochastic Recognition of Regular Structures in Facade Images.
IPSJ Trans. Computer Vision and Applications, May 2012. Demo.
A Weak Structure Model for Regular Pattern Recognition Applied to Facade Images
Proc. of ACCV, November 2010. (oral) Video.
We propose a surface mesh refinement pipeline for accurate 3D reconstruction from multiple images that deals with some possible sources of inaccuracy present in the input data.
Namely, we address the problem of inaccurate camera calibration by including a method adjusting the camera parameters in a global structure-and-motion problem, which is solved with a depth map for representation that is suitable to large scenes.
Secondly, we take the triangular mesh and calibration improved by the global method in the first phase to refine the surface both geometrically and radiometrically. Here we propose surface energy which combines photo-consistency with contour matching and minimize it with a gradient descent method.
Our main contribution lies in effective computation of the gradient that naturally balances weight between regularizing and data terms by employing scale space approach to find the correct local minimum.
The results are demonstrated on standard high-resolution datasets and a complex outdoor scene.
Refinement of Surface Mesh for Accurate Multi-View Reconstruction
International Journal of Virtual Reality, March 2010. (extended version, pre-print)
Presented at ACCV Modeling-3D workshop, September 2009. Supplemental video. Presentation.
We present a novel depth map fusion algorithm for image-based surface reconstruction from a set of calibrated images. The problem is formulated in a Bayesian framework, where estimates of depth and visibility in a set of selected cameras are iteratively improved.
The core of the algorithm is the minimization of overall geometric L2 error between measured 3D points and the depth estimates. In the visibility estimation task, the algorithm aims at outlier detection and noise suppression, as both types of errors are often present in the stereo output. The geometrical formulation allows for simultaneous refinement of the external camera parameters, which is an essential step for obtaining accurate results even when the calibration is not precisely known.
We show that the results obtained with our method are comparable to other state-of-the-art techniques.
Depth Map Fusion with Camera Position Refinement
Proc. of CVWW, February 2009. Presentation.
Representation of Geometric objects for 3D photography
Master thesis, CTU Prague, January 2008. Presentation (czech).
Awarded Dean's prize for outstanding Master thesis.