Freehand ray pointing is a common input interaction in extended reality (XR). Due to noisy input recognition and imprecise hand movements, input is often slow and error-prone, especially when targets are small or located close to each other. We present a novel computational framework for predicting user intention during point-and-select tasks based on a grid representation of the environment and trajectory. Our machine learning approach uses gaze and hand movement data from target selection tasks to predict grid cells within which the user intends to make selections. We trained an ensemble model, that combines outputs from unimodal models with either gaze or hand data and a multimodal model with both gaze and hand data. The grid representation-based ensemble model outperforms a raw trajectory-based baseline model, achieving 7% to 12.7% higher accuracy at different grid granularity levels. Further, 10-fold cross-validation showed that our ensemble model can achieve average prediction accuracy of 85% and 92% at 2s and 1s prior to users' selections, respectively. Our novel approach can provide highly accurate predictions during point-and-select tasks across users, which can be used to enable selection facilitation techniques, thus improving performance and user experience during freehand pointing.
Workflow of the target prediction system. (a) The time series of the eye gaze (green solid line) and hand pointing (red solid line) are captured in real time. The dash line segments represents where the user is going to look and point in the near future. (b) The time series are then encoded and put into a LSTM neural network. The neural network calculates the possibilities of the grid cells being clicked in different time spans. (c) The grid cell which is most likely to be selected is predicted before the user actually clicks. The object in that cell is highlighted in advance to save user's effort.