I am a postdoc at Idiap Research Institute with Dr. Jean-Marc Odobez since Jan 2020. I obtained my PhD degree from EPFL, Switzerland. I received my Bachelor's degree and Master's degree in 2011 and 2014 respectively, both from Xian Jiaotong University, China. During 2014-2015, I was an algorithm engineer in Alibaba, China.
My current research focus on 3D gaze estimation. Previously, I also studied problems of 3D head pose estimation, 3D facial expression modelling and head nod detection. My research is supported by MuMMER project and UBImpressed project.
Gang Liu, Yu Yu, Kenneth Alberto Funes Mora and Jean-Marc Odobez
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), accepted 2019
Non-invasive gaze estimation methods usually regress gaze directions directly from a single face or eye image. However, due to important variabilities in eye shapes and inner eye structures amongst individuals, universal models obtain limited accuracies and their output usually exhibit high variance as well as biases which are subject dependent. Therefore, increasing accuracy is usually done through calibration, allowing gaze predictions for a subject to be mapped to his/her actual gaze. In this paper, we introduce a novel image differential method for gaze estimation. We propose to directly train a differential convolutional neural network to predict the gaze differences between two eye input images of the same subject. Then, given a set of subject specific calibration images, we can use the inferred differences to predict the gaze direction of a novel eye sample. The assumption is that by allowing the comparison between two eye images, annoyance factors (alignment, eyelid closing, illumination perturbations) which usually plague single image prediction methods can be much reduced, allowing better prediction altogether. Experiments on 3 public datasets validate our approach which constantly outperforms state-of-the-art methods even when using only one calibration sample or when the latter methods are followed by subject specific gaze adaptation.
Yu Yu, Jean-Marc Odobez
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020
PDF (Camera-ready coming soon)
Although automatic gaze estimation is very important to a large variety of application areas, it is difficult to train accurate and robust gaze models, in great part due to the difficulty in collecting large and diverse data (annotating 3D gaze is expensive and existing datasets use different setups). To address this issue, our main contribution in this paper is to propose an effective approach to learn a low dimensional gaze representation without gaze annotations, which to the best of our best knowledge, is the first work to do so. The main idea is to rely on a gaze redirection network and use the gaze representation difference of the input and target images (of the redirection network) as the redirection variable. A redirection loss in image domain allows the joint training of both the redirection network and the gaze representation network. In addition, we propose a warping field regularization which not only provides an explicit physical meaning to the gaze representations but also avoids redirection distortions. Promising results on few-shot gaze estimation (competitive results can be achieved with as few as ≤ 100 calibration samples), cross-dataset gaze estimation, gaze network pretraining, and another task (head pose estimation) demonstrate the validity of our framework.
Yu Yu, Gang Liu, Jean-Marc Odobez
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019
As an indicator of human attention gaze is a subtle behavioral cue which can be exploited in many applications. However, inferring 3D gaze direction is challenging even for deep neural networks given the lack of large amount of data (groundtruthing gaze is expensive and existing datasets use different setups) and the inherent presence of gaze biases due to person-specific difference. In this work, we address the problem of person-specific gaze model adaptation from only a few reference training samples. The main and novel idea is to improve gaze adaptation by generating additional training samples through the synthesis of gaze-redirected eye images from existing reference samples. In doing so, our contributions are threefold: (i) we design our gaze redirection framework from synthetic data, allowing us to benefit from aligned training sample pairs to predict accurate inverse mapping fields; (ii) we proposed a self-supervised approach for domain adaptation; (iii) we exploit the gaze redirection to improve the performance of person-specific gaze estimation. Extensive experiments on two public datasets demonstrate the validity of our gaze retargeting and gaze estimation framework.
Remy Siegfried, Yu Yu and Jean-Marc Odobez
ACM Symposium on Eye Tracking Research & Applications (ETRA) 2019
Recognizing eye movements is important for gaze behavior understanding like in human communication analysis (human-human or robot interactions) or for diagnosis (medical, reading impairments). In this paper, we address this task using remote RGB-D sensors to analyze people behaving in natural conditions. This is very challenging given that such sensors have a normal sampling rate of 30 Hz and provide low-resolution eye images (typically 36x60 pixels), and natural scenarios introduce many variabilities in illumination, shadows, head pose, and dynamics. Hence gaze signals one can extract in these conditions have lower precision compared to dedicated IR eye trackers, rendering previous methods less appropriate for the task. To tackle these challenges, we propose a deep learning method that directly processes the eye image video streams to classify them into fixation, saccade, and blink classes, and allows to distinguish irrelevant noise (illumination, low-resolution artifact, inaccurate eye alignment, difficult eye shapes) from true eye motion signals. Experiments on natural 4-party interactions demonstrate the benefit of our approach compared to previous methods, including deep learning models applied to gaze outputs.
Gang Liu, Yu Yu, Kenneth Alberto Funes Mora and Jean-Marc Odobez
British Machine Vision Conference (BMVC) 2018
Gaze estimation methods usually regress gaze directions directly from a single face or eye image. However, due to important variabilities in eye shapes and inner eye structures amongst individuals, universal models obtain limited accuracies and their output usually exhibit high variance as well as biases which are subject dependent. Therefore, increasing accuracy is usually done through calibration, allowing gaze predictions for a subject to be mapped to his/her specific gaze. In this paper, we introduce a novel image differential method for gaze estimation. We propose to directly train a convolutional neural network to predict the gaze differences between two eye input images of the same subject. Then, given a set of subject specific calibration images, we can use the inferred differences to predict the gaze direction of a novel eye sample. The assumption is that by allowing the comparison between two eye images, annoyance factors (alignment, eyelid closing, illumination perturbations) which usually plague single image prediction methods can be much reduced, allowing better prediction altogether. Experiments on 3 public datasets validate our approach which constantly outperforms state-of-the-art methods even when followed by subject specific gaze adaptation.
Yu Yu, Gang Liu and Jean-Marc Odobez
European Conference on Computer Vision Workshop (ECCVW) 2018,
As an indicator of attention, gaze is an important cue for human behavior and social interaction analysis. Recent deep learning methods for gaze estimation rely on plain regression of the gaze from images without accounting for potential mismatches in eye image cropping and normalization. This may impact the estimation of the implicit relation between visual cues and the gaze direction when dealing with low resolution images or when training with a limited amount of data. In this paper, we propose a deep multitask framework for gaze estimation, with the following contributions. i) we proposed a multitask framework which relies on both synthetic data and real data for end-to-end training. During training, each dataset provides the label of only one task but the two tasks are combined in a constrained way. ii) we introduce a Constrained Landmark-Gaze Model (CLGM) modeling the joint variation of eye landmark locations (including the iris center) and gaze directions. By relating explicitly visual information (landmarks) to the more abstract gaze values, we demonstrate that the estimator is more accurate and easier to learn. iii) by decomposing our deep network into a network inferring jointly the parameters of the CLGM model and the scale and translation parameters of eye regions on one hand, and a CLGM based decoder deterministically inferring landmark positions and gaze from these parameters and head pose on the other hand, our framework decouples gaze estimation from irrelevant geometric variations in the eye image (scale, translation), resulting in a more robust model. Thorough experiments on public datasets demonstrate that our method achieves competitive results, improving over state-of-the-art results in challenging free head pose gaze estimation tasks and on eye landmark localization (iris location) ones.
Head pose estimation is a fundamental task for face and social related research. Although 3D morphable model (3DMM) based methods relying on depth information usually achieve accurate results, they usually require frontal or mid-profile poses which preclude a large set of applications where such conditions can not be garanteed, like monitoring natural interactions from fixed sensors placed in the environment. A major reason is that 3DMM models usually only cover the face region. In this paper, we present a framework which combines the strengths of a 3DMM model fitted online with a prior-free reconstruction of a 3D full head model providing support for pose estimation from any viewpoint. In addition, we also proposes a symmetry regularizer for accurate 3DMM fitting under partial observations, and exploit visual tracking to address natural head dynamics with fast accelerations. Extensive experiments show that our method achieves state-of-the-art performance on the public BIWI dataset, as well as accurate and robust results on UbiPose, an annotated dataset of natural interactions that we make public and where adverse poses, occlusions or fast motions regularly occur
Yu Yu, Kenneth Alberto Funes Mora and Jean-Marc Odobez
International Conference on Automatic Face and Gesture Recognition (FG) 2017
Accurate and robust 3D head pose estimation is important for face related analysis. Though high accuracy has been achieved by previous works based on 3D morphable model (3DMM), their performance drops with extreme head poses because such models usually only represent the frontal face region. In this paper, we present a robust head pose estimation framework by complementing a 3DMM model with an online 3D reconstruction of the full head providing more support when handling extreme head poses. The approach includes a robust online 3DMM fitting step based on multi-view observation samples as well as smooth and face-neutral synthetic samples generated from the reconstructed 3D head model. Experiments show that our framework achieves state-of-the-art pose estimation accuracy on the BIWI dataset, and has robust performance for extreme head poses when tested on natural interaction sequences
Remy Siegfried, Yu Yu and Jean-Marc Odobez
ACM International Conference on Multimodal Interaction (ICMI) 2017
Gaze is an important non-verbal cue involved in many facets of social interactions like communication, attentiveness or attitudes. Nevertheless, extracting gaze directions visually and remotely usually suffers large errors because of low resolution images, inaccurate eye cropping, or large eye shape variations across the population, amongst others. This paper hypothesizes that these challenges can be addressed by exploiting multimodal social cues for gaze model adaptation on top of an head-pose independent 3D gaze estimation framework. First, a robust eye cropping refinement is achieved by combining a semantic face model with eye landmark detections. Investigations on whether temporal smoothing can overcome instantaneous refinement limitations is conducted. Secondly, to study whether social interaction convention could be used as priors for adaptation, we exploited the speaking status and head pose constraints to derive soft gaze labels and infer person-specific gaze bias using robust statistics. Experimental results on gaze coding in natural interactions from two different settings demonstrate that the two steps of our gaze adaptation method contribute to reduce gaze errors by a large margin over the baseline and can be generalized to several identities in challenging scenarios.
Catharine Oertel, José Lopes, Yu Yu, Kenneth Alberto Funes Mora, Joakim Gustafson, Alan Black and Jean-Marc Odobez
ACM International Conference on Multimodal Interaction (ICMI) 2016
Current dialogue systems typically lack a variation of audiovisual feedback tokens. Either they do not encompass feedback tokens at all, or only support a limited set of stereotypical functions. However, this does not mirror the subtleties of spontaneous conversations. If we want to be able to build an artificial listener, as a first step towards building an empathetic artificial agent, we also need to be able to synthesize more subtle audio-visual feedback tokens. In this study, we devised an array of monomodal and multimodal binary comparison perception tests and experiments to understand how different realisations of verbal and visual feedback tokens influence third-party perception of the degree of attentiveness. This allowed us to investigate i) which features (amplitude, frequency, duration...) of the visual feedback influences attentiveness perception; ii) whether visual or verbal backchannels are perceived to be more attentive iii) whether the fusion of unimodal tokens with low perceived attentiveness increases the degree of perceived attentiveness compared to unimodal tokens with high perceived attentiveness taken alone; iv) the automatic ranking of audio-visual feedback token in terms of conveyed degree of attentiveness
Yiqiang Chen, Yu Yu and Jean-Marc Odobez
International Conference on Computer Vision Workshop (ICCVW) 2015
As a non-verbal communication mean, head gestures play an important role in face-to-face conversation and recognizing them is therefore of high value for social behavior analysis or Human Robotic Interactions (HRI) modelling. Among the various gestures, head nod is the most common one and can convey agreement or emphasis. In this paper, we propose a novel nod detection approach based on a full 3D face centered rotation model. Compared to previous approaches, we make two contributions. Firstly, the head rotation dynamic is computed within the head coordinate instead of the camera coordinate, leading to pose invariant gesture dynamics. Secondly, besides the rotation parameters, a feature related to the head rotation axis is proposed so that nod-like false positives due to body movements could be eliminated. The experiments on two-party and four-party conversations demonstrate the validity of the approach.