Design of the proposed approach

The main goal of the proposed work is developing a training technique for social robots that consists in learning from their own errors, i.e. exploit the information obtained by the robot sensors in unexpected situations with the objective of finding out how to act in a similar future context. The learning phase is supervised by a human operator, who has to notify the machine the execution of a wrong step and provide a decider that recognizes the current conditions. With the help of machine learning procedures, it is possible to improve the behavior of an autonomous robot and generalize over different but similar circumstances.

In order to catch opportunities and activate recovery actions, the cameras play a fundamental role. Image acquired in certain situations by the robot ends in two modules: one to extract distance measurements of the surroundings and localize obstacles in a portion of image, while the second applies machine learning algorithm in order to identify obstacles, people or events. The depth module provides useful information to the detection module for improving the quality of the computation. The pre-processing phase consists in data augmentation, normalization and vectorization of the input images, and it is essential to highlight the proper features before the submission into the decision structure.

Detection Module

A public environment, which is almost all the places where we spend time every day, is a very insidious space for an autonomous robot. To be defined as such, a robot has to move properly around people, doors, aisles and random objects that appear without warning. A robot endowed with the ability of discerning particular objects during its tasks can behave very smartly. There is in fact some kind of situations that is not possible to understand for a robot equipped only with laser scan or ultrasound sensors. Maybe its navigation plan is perfect and extremely well planned, but the agent will never know if one of the obstacle on its path is a human being, that the floor of a room is completely covered with journals and it is still possible to proceed or that a red light means that the way is closed, without a camera and the decisional power of the Support Vector Machine models.

Obstacles do not belong all to the same category. A social robot must have the power of discerning unmanned objects from real people, closed doors from walls, dangerous situations from sustainable ones and so on. For this purposes, we need mostly RGB images. They will face a pre-processing phase in order to make the interested features easily extractable. Keeping in mind the fundamental aspects that the system must have, explained in page Panoramic, I focused on developing an incremental learning approach that performs well also with a limited availability of training instances. The key is in the reusing of true positive instances and in the acquisition of false positives to further train the decider system.

The Detection Module takes preprocessed depth and, mostly, RGB images in input. The preprocessing operation are explained in details in the next section, and now we focus on what happens before and after such phase.

In order to spot all kinds of object in an image shot by the robot camera, we have to analyze it portion by portion. The Sliding Window approach is a technique that consists of creating sub-images that span all the surface of the original picture by creating a window on the image with four couples of pixel coordinates and sliding it across the file. Every sub-image we obtain is then preprocessed and sent to the SVM models. If we have a cascade architecture, the major part of true negatives is discarded in the early levels.

The core of this project is a set of Support Vector Machines, a supervised binary learning algorithm. They are very adaptive and allow the user to implement separation function different from the linear type, given a customizable kernel. Each SVM model is trained to recognize a particular situation, like the presence of a door handle, to indicate that the related door is closed, or of a person, by spotting his legs. They receive in input two different kinds of preprocessed images:

HOG-descriptors;
vectors of principal components.

HOG stands for Histograms of Oriented Gradients and they derive mathematically from the changes in brightness among the pixels. When a certain change is important, it means that probably we are in a region of the image where a subject terminates and leaves space to the background or to another subject, i.e. we have a border. Instead, a vector of principal components derives from the application of the Principal Component Analysis algorithm to find which are the most significant bases of the data structure given as input.

In order to have more reliability and precision, SVM models organize in a cascade fashion, where every level, except the first, executes the training on a dataset composed by the true and false positives of the previous layer. The cascade procedure is deepened in the next section.

Preprocessing

In order to fulfill the objectives expressed in Panoramic, it is important to capitalize every sample and design a quick learning system. Data augmentation answers the need for a small collection of training images. Thanks to this technique, from one shot we might obtain a set of different training instances, by flipping the original image, slightly rotating it and changing its resolution. Even if we did not acquired all the poses of the subject in exam, we compensate by creating in few simple steps a lot of useful pictures that substitute the absence of a larger dataset.

Normalization is a delicate procedure that can help during the recognition task or that can ruin all the efforts if done wrong. Normally, graphical inputs go through a gamma correction, but the tuning of the gamma parameter is still an open problem. For this reason, the solution implemented mainly resides on the square root normalization, which compensate strong lights and colors, and it is empirically proven its affinity with HOG-descriptors based classifiers.

Images are transformed into vectors utilizing well-known procedures like HOG-descriptor extraction and PCA reduction.

Incremental learning is accomplished thanks to a cascade of classifiers. Training one Support Vector Machine model with all the information we have available result in a machine that is not able to generalize over different circumstances. A single SVM is not as powerful and precise as a neural network with tens of hidden layers. Speaking of which, the idea of a cascade system is similar to the working principle of a neural network: every level is specialized to extract a specific feature, i.e. it means that it filters a part of false positives obtained during the agent's task. For the purpose of understanding, false positives are instances incorrectly labeled as positive, i.e. with the number 1, by the machine. In the image below it is possible to see the learning phase of the serial structure: initially, all the samples, positive and negative, come from the environment. In the successive layers, the positive set is the same, therefore we do not have to take shot about the situation we want to recognize, while the negative set is substituted with a false positives set. Training on false positives assures a low percentage of mistakes because the machine learns from its own errors

Thanks to this system, a binary Support Vector Machine can express its true separation power and reject, step by step, all the images we are not interested in. Computation times are contained and allow the robot to be ready after some minutes from the start of a training phase. If an input images makes its way to the final classifier and it is accepted, then the detection algorithms labels it as positive because it depicts what we are looking for.

Depth Module

The Depth Module mainly executes two tasks:

from the input depth images it is possible to extract very accurate measurements of the surrounding world thanks to the distance values encoded in the pixel of a depth image;
thanks to an algorithm, first developed by Y. Zhu, B. Yi, and T. Guo in "Simple outdoor environment obstacle detection method based on information fusion of depth and infrared", where, as a result of the intersection of two binary images, that are obtained as dilated versions of the two edge maps, one based on a depth image and the other one an infrared image, it is possible to extrapolate, with high confidence, the position of obstacles close to the robot agent.

The first task might be useful in a situations where it is not possible to trust completely the sensor measurements of a robot, like laser scan or ultrasound scan, because from depth images is possible to obtain an extremely good estimate of the environment dimensions. For example, in the following video, a standard navigation technique of ROS, move_base (first video), is applied against a camera-based navigation (second video).

gap-cross_move-base_cut.mp4

gap-cross_vision_cut.mp4

The second task aims to improve the computation operation performed by the Detection Module. As we saw in the previous section on this page, the Detection Module scans the whole image in order to detect what is looking for with the Sliding Window approach. If we know a priori the position of the objects to identify, obviously the detecting operation will become faster and more reliable. Here I show the result of an example situation.

A pair of close obstacles

Dilated edged depth image

Dilated edged infrared image

Intersection

Obstacle position extraction