This work proposes a person identification and tracking method based on Yolact and SIFT (Scale Invariant Feature Transform) for an agricultural person-follower robot using a single monocular camera. First, Yolact algorithm was used to detect all the people in the image frame, then their semantic segmentation was obtained and used to separate the people from the background. Secondly, SIFT was used to extract features from all the detected people. Finally, the features obtained are compared with the ones obtained in the last frame or with reference images and matched by calculating the approximate nearest neighbors.

Introduction

Robots are playing an important role by changing the way how farms operate, decreasing costs and making up for the manpower shortage. From simple robots that only classify apples to complex autonomous mobile systems that navigate and harvest fruits by them self; most of them uses computer vision and deep learning for classification and navigation purposes. However, real time object detection and classification is difficult and computationally expensive.

Our proposed method is based on Yolact and SIFT features thus it is scale resilient, it doesn’t depend on clothes colors, it doesn’t need training time and by matching the features with the features extracted in the previous frame decrease the error.

The main application for our proposed method is for agricultural mobile robots that must follow a farmer

Method

Our proposed method is divided into 2 main stages. The first stage is the “Searching stage”, where the robot is looking for the target person for the first time, i.e., the algorithm has just been initialized or the robot has lost the target. This stage can be triggered when the score obtained in the tracking stage when comparing the current frame with last frame is lower than a previously defined threshold. In the second stage, the tracking stage the algorithm compares the features from the current frame with the ones in the previous frame, in such manner the changes of orientation or scale are not excessively different, as it could happen with the original reference images.


In this stage, the master is alone and multiples photos in different angles are taken, then all the pictures are processed individually using Yolact to obtain their semantic segmentation and remove the background. Finally, the pictures without background are processed with SIFT to extract and save their features.

In this stage, there is no clear information of whom is the master because the algorithm has just been initialized, thus there is no features extracted from a previous frame to compare the feature of the current frame, or due to a low score in the Tracking stage, that made impossible to identify the master correctly.

First, the semantic segmentation of each person in the frame is extracted using Yolact, then the features of each person are extracted independently from each other using SIFT and then each person features are match with each reference image using FLANN (Fast Library for Approximate Nearest Neighbors). Finally, the scores obtained in each case are added together. This process is repeated with each person and the one that get the higher score that is above the threshold is identified as the master。


This stage is similar to the last one, but instead of matching the features with the ones extracted from the reference images, the newly extracted features are compared with the ones extracted in the previous frame. Because it is reasonable to assume that the master cannot change his position or orientation instantaneously then it better to compare the current features with the previous features. Thus, the orientation and scale changes are not so abrupt, and more matches can be obtained.

Experiments

The experiments were conducted outside, simulating common scenarios in an agricultural field. The objective in all of them is to test the capacity of the algorithm to keep tracking the master and the capability to recover if the master is lost.

The experiment results show that the overall algorithm success rate is above 90%, but when the distance is more than 2 meters between the master and the camera, or the experiments were conducted in the afternoon the success rate decrease. Because for the first case it is harder to extract more features when the images are smaller and for the second case because the light conditions differ with the reference images.

Two people walking together in a straight line.

Occlusion by an object.

Master walking in the agricultural field avoiding other people.

Occlusion by a person walking between the camera and the master.

Results

Conclusions

Our proposed method was able to identify and track a master person in an open field under different sunlight conditions even though people wore similar plain clothes, without designs and the experiments were performed in different days.

In the future, a dynamic threshold estimator will be implemented, to deal with the dynamic range of matches produced when the target person is far or close to the robot. Furthermore, a target trajectory predictor will be developed to increase the performance of the algorithm when the master is occluded by other people. Also, a local planner will be developed to generate the corresponding waypoints so experiments with a real robot can be conducted.

Files