Object Retrieval in VR - Siggraph Asia 2023

Commonsense Knowledge-Driven Joint Reasoning Approach for Object Retrieval in Virtual Reality

Haiyan JIANG 1,2, Dongdong WENG1,∗, Xiaonuo DONGYE1, Le LUO1, Zhenliang ZHANG2, ∗

1. Beijing Institute of Technology (BIT), China; 2. National Key Laboratory of General Artificial Intelligence, BIGAI, China; ∗Corresponding authors.

ACM Trans. Graph. 42, 6, Article 198 (December 2023), 18 pages. https://doi.org/10.1145/3618320

Retrieving out-of-reach objects is a crucial task in virtual reality (VR). One of the most commonly used approaches for this task is the gesture-based approach, which allows for bare-hand, eyes-free, and direct retrieval. However, previous work has primarily focused on assigned gesture design, neglecting the context. This can make it challenging to accurately retrieve an object from a large number of objects due to the one-to-one mapping metaphor, limitations of finger poses, and memory burdens. There is a consensus that objects and contexts are related, which suggests that the object expected to be retrieved is related to the context, including the scene and the objects with which users interact. As such, we propose a commonsense knowledge-driven joint reasoning approach for object retrieval, where human grasping gestures and context are modeled using an And- Or graph (AOG). This approach enables users to accurately retrieve objects from a large number of candidate objects by using natural grasping gestures based on their experience of grasping physical objects. Experimental results demonstrate that our proposed approach improves retrieval accuracy. We also propose an object retrieval system based on the proposed approach. Two user studies show that our system enables efficient object retrieval in virtual environments (VEs).

Introduction

How can we express our intention to retrieve objects in VR to interact with the environment?
Is context helpful and how?

Our contributions:

Propose a commonsense knowledge-based joint reasoning approach for object retrieval based on gesture.
Develop an object retrieval system based on the proposed joint reasoning approach.
Conduct two evaluation studies to demonstrate the usability of the system.

Our proposed approach allows users to retrieve objects by using grasping gestures as for the physical counterparts. The retrieval probability of each object for retrieval is jointly reasoned based on commonsense knowledge of the relationship between objects and contexts represented by an And-Or graph. ⊗ represents the multiplication of the probabilities of each part.

Examples: knife and skillet retrieval in different contexts with the same grasping gesture.

Background - Commonsense knowledge used for object retrieval reasoning.

Relationship between objects and scene background

Object occurrence: non-interactive object and interactive object

Object occurrence: non-interactive object and interactive object

Joint reasoning approach

SH-AOG to parse the context and human information

Left: The context and human information are parsed by and space-human and-or graph (SH-AOG) including an S-AOG and an H-AOG. A retrieval event is the selection of the parse graph pg with high probability from all pgs. Middle: Instantiating SH-AOG. Bottom in the middle: Legend of graphs. Right: The corresponding scene of the aforementioned SH-AOG.

Implementation

The part SH-AOG of an environment and the joint reasoning process for the kitchen knife retrieval. The red edges in Or-nodes indicate the branches of the parse graph 𝑝𝑔∗ with the highest joint probability for the terminal node of the kitchen knife.

Decision examples

Information used: H: the grasping gesture; S: the scene background; NI: the non-interactive objects; I: the interactive objects; P: human preference.

Ablation study results. The Top-1, Top-3 and Top-5 accuracy of retrieval in the ablation study. Preference is only taken into account in three tasks.

System

Workflow

First, a HandPose Estimator is used to predict the hand pose in real time, and a GraspType Estimator is proposed to predict the probabilities grasp types. Meanwhile, context information is captured. All information is used for joint reasoning. Finally, an ObjectPose Estimator is proposed to predict the candidate’s pose, and a HandPose Optimizer is proposed to improve the plausibility of the hand-object interaction.

The confusion matrix of gesture-type prediction results.

Object grasp prediction and optimization results

Hardware

An RGB camera is attached to the HMD to capture the hand. An HTC Vive Tracker is used to track the wrist pose. The motion of the virtual human is calculated from the tracked head and wrist poses using the inverse kinematics library - Final IK.

User can situate the virtual environment and retrieve expected virtual object to interact with the environment.

System evaluation

Evaluation 1: comparison with two traditional approaches

A user study was conducted, comparing our system with two traditional methods. Menu group, based on menu selection technique, allows users to retrieve candidates from a menu. Touch group, based on the grasping metaphor, allows users to select an object candidate by directly touching it as in the physical world. The results show that our approach has the same accuracy as the other two baselines when costs less time.