GFIE: A Dataset and Baseline for Gaze-Following from

2D to 3D in Indoor Environments

Paper & Supplementary material

Abstract

Gaze-following is a kind of research that requires locating where the person in the scene is looking automatically under the topic of gaze estimation. It is an important clue for understanding human intention, such as identifying objects or regions of interest to humans. However, a survey of datasets used for gaze-following tasks reveals defects in the way they collect gaze point labels. Manual labeling may introduce subjective bias and is labor-intensive, while automatic labeling with an eye-tracking device would alter the person’s appearance. In this work, we introduce GFIE, a novel dataset recorded by a gaze data collection system we developed. The system is constructed with two devices, an Azure Kinect and a laser rangefinder, which generate the laser spot to steer the subject’s attention as they perform in front of the camera. And an algorithm is developed to locate laser spots in images for annotating 2D/3D gaze targets and removing ground truth introduced by the spots. The whole procedure of collecting gaze behavior allows us to obtain unbiased labels in unconstrained environments semi-automatically. We also propose a baseline method with stereo field-of-view (FoV) perception for establishing a 2D/3D gaze-following benchmark on the GFIE dataset.

An overview of our research

Motivation

Figure 1

Most datasets are manually annotated, but the subjectivity of annotators may cause annotations to deviate from the actual gaze target. This is demonstrated by the sample in Figure 1 a) where each annotator has a different opinion on the gaze target of the same person. In addition, labor-intensive is another drawback. The eye-tracking device in Figure 1 b) can capture annotations automatically but alter subjects’ appearance in the dataset, which brings the gap with the gaze-related behavior in the natural environment.

To address these problems, as shown in Figure 1 c), we propose a novel system for establishing our GFIE dataset that provides accurate annotations and clean training data recorded in natural environments. The system consists of a laser rangefinder and an RGB-D camera Azure Kinect, which allows us to manipulate the laser rangefinder to guide the subject’s gaze target through the laser spot while recording their activities with the RGB-D camera. After detecting the laser spot in the image by our proposed algorithm, the gaze target of the person in the image can be located. Based on the distance to the laser spot measured by the laser rangefinder, the 3D gaze target can also be reconstructed. Considering that the laser spot introduces ground truth to the image, we employ an image inpainting algorithm to eliminate it for constructing the final dataset. Most of the processes are automated, alleviating the need for human resources.

Workflow for GFIE dataset generation