Current Research

Text-to-3D using Gaussian Splatting

Automatic text-to-3D generation that combines Score Distillation Sampling (SDS) with the optimization of volume rendering has achieved remarkable progress in synthesizing realistic 3D objects. Yet most existing text-to-3D methods by SDS and volume rendering suffer from inaccurate geometry, e.g., the Janus issue, since it is hard to explicitly integrate 3D priors into implicit 3D representations. Besides, it is usually time-consuming for them to generate elaborate 3D models with rich colors. In response, this paper proposes GSGEN, a novel method that adopts Gaussian Splatting, a recent stateof-the-art representation, to text-to-3D generation. GSGEN aims at generating high-quality 3D objects and addressing existing shortcomings by exploiting the explicit nature of Gaussian Splatting that enables the incorporation of 3D prior. Specifically, our method adopts a progressive optimization strategy, which includes a geometry optimization stage and an appearance refinement stage. In geometry optimization, a coarse representation is established under 3D point cloud diffusion prior along with the ordinary 2D SDS optimization, ensuring a sensible and 3D-consistent rough shape. Subsequently, the obtained Gaussians undergo an iterative appearance refinement to enrich texture details. In this stage, we increase the number of Gaussians by compactness-based densification to enhance continuity and improve fidelity. With these designs, our approach can generate 3D assets with delicate details and accurate geometry. Extensive evaluations demonstrate the effectiveness of our method, especially for capturing high-frequency components..

Leveraging Large Language Model for Heterogeneous Ad Hoc Teamwork Collaboration

Compared with the widely investigated homogeneous multi-robot collaboration, heterogeneous robots with different capabilities can provide a more efficient and flexible collaboration for more complex tasks. In this paper, we consider a more challenging heterogeneous ad hoc teamwork collaboration problem where an ad hoc robot joins an existing heterogeneous team for a shared goal. Specifically, the ad hoc robot collaborates with unknown teammates without prior coordination, and it is expected to generate an appropriate cooperation policy to improve the efficiency of the whole team. To solve this challenging problem, we leverage the remarkable potential of the large language model (LLM) to establish a decentralized heterogeneous ad hoc teamwork collaboration framework that focuses on generating reasonable policy for an ad hoc robot to collaborate with original heterogeneous teammates. A training-free hierarchical dynamic planner is developed using the LLM together with the newly proposed Interactive Reflection of Thoughts (IRoT) method for the ad hoc agent to adapt to different teams. We also build a benchmark testing dataset to evaluate the proposed framework in the heterogeneous ad hoc multi-agent tidyingup task. Extensive comparison and ablation experiments are conducted in the benchmark to demonstrate the effectiveness of the proposed framework. We have also employed the proposed framework in physical robots in a real-world scenario.

Demonstrating HumanTHOR: A Simulation Platform and Benchmark for Human-Robot Collaboration in a Shared Workspace 

Human-robot collaboration (HRC) in a shared workspace has become a common pattern in real-world robot applications and has garnered significant research interest. However, most existing studies for human-in-the-loop (HITL) collaboration with robots in a shared workspace are evaluated in either simplified game environments or physical platforms, which fall short in limited realistic significance or limited scalability, respectively. To support future studies, we build an embodied framework named HumanTHOR, which enables humans to act in the simulation environment through VR devices to support HITL collaborations in a shared workspace. To validate our system, we build a benchmark of everyday tasks and conduct a preliminary user study with two baseline algorithms. The results show that the robot can successfully assist humans in collaboration, which substantiates the significance of HRC. The comparison among different levels of baselines affirms that our system can adequately evaluate the capability of robots and serve as a benchmark for the calibration of robot algorithms while indicating the existing space for future study. In summary, our system provides a preliminary foundation for future research on HRC in the shared workspace. 

CompetEvo: Towards Morphological Evolution from Competition 

Training an agent to adapt to specific tasks through co-optimization of morphology and control has widely attracted attention. However, whether there exists an optimal configuration and tactics for agents in a multiagent competition scenario is still an issue that is challenging to definitively conclude. In this context, we propose competitive evolution(CompetEvo), which co-evolves agents’ designs and tactics in confrontation. We build arenas consisting of three animals and their evolved derivatives, placing agents with different morphologies in direct competition with each other. The results reveal that our method enables agents to evolve a more suitable design and strategy for fighting compared to fixed-morph agents, allowing them to obtain advantages in combat scenarios. Moreover, we demonstrate the amazing and impressive behaviors that emerge when confrontations are conducted under asymmetrical morphs.

Stimulate the Potential of Robots via Competition

It is common for us to feel pressure in a competition environment, which arises from the desire to obtain success comparing with other individuals or opponents. Although we might get anxious under the pressure, it could also be a drive for us to stimulate our potentials to the best in order to keep up with others. Inspired by this, we propose a competitive learning framework which is able to help individual robot to acquire knowledge from the competition, fully stimulating its dynamics potential in the race. Specifically, the competition information among competitors is introduced as the additional auxiliary signal to learn advantaged actions. We further build a Multiagent-Race environment, and extensive experiments are conducted, demonstrating that robots trained in competitive environments outperform ones that are trained with SoTA algorithms in single robot environment.

Vision-Language Foundation Models as Effective Robot Imitators 

Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step visionlanguage comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on languageconditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show that RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. RoboFlamingo can be trained or evaluated on a single GPU server, and we believe it has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy. 

Extracting Dynamic Navigation Goal from Natural Language Dialogue

Effective access to relevant environmental changes in large human environments is critical for service robots to perform tasks. Since the position of a dynamic goal such as a human is variable, it will be difficult for the robot to locate him accurately. It is worth noting that humans can obtain information through social software, and deal with daily affairs. The current robots search for targets without considering some implicit information changes, which leads to not searching for the target objects in the end. Therefore, we propose to extract human implicit location change information from group chats dialogues, i.e., watching dialogues in group chats and extracting who, when, and where(3W), to assist robots in finding explicit character targets. Then we propose a dynamic spatio-temporal map(DSTM) to store the change information as knowledge for the robot. When the robot identifies a target person, it needs to follow the changing information in the scene to infer the possible location and probability of the target person, and then develop a search strategy. We deployed our framework on a custom mobile robot and performed instruction navigation tasks in a university building to evaluate our approach. We demonstrate the ability of our framework to collect and use information in a large human social environment.

Lifelong Learning for Industrial Defect Classification

Automatic defect inspection is an important application for the development of smart factories in the era of Industry 4.0. It gathers data from production lines to train a model to automatically recognize certain types of defects. However, the defect types may vary in the production process, and it is difficult for the old model to adapt to new types of defects directly. Considering this problem, we propose an industrial defect classification framework based on lifelong learning, which continuously updates the defect classification model to adapt to different industrial scenarios as new defect appears. Specifically, a novel recursive gradient optimization (RGO) lifelong learning method is used to train the defect classification model, which only needs a fixed network capacity and does not need data replay. The proposed framework is evaluated on an experimental setup of six defect classification tasks. Extensive experiments in real scenarios are performed, demonstrating that the proposed framework can effectively relieve the catastrophic forgetting problem in lifelong learning compared with other state-of-the-art methods.

Knowledge-Based Embodied Question Answering 

We propose a novel Knowledge-based Embodied Question Answering (K-EQA) task, in which the agent intelligently explores the environment to answer various questions with the knowledge. Different from explicitly specifying the target object in the question as existing EQA work, the agent can resort to external knowledge to understand more complicated question such as “Please tell me what are objects used to cut food in the room?”, in which the agent must know the knowledge such as “knife is used for cutting food”. To address this K-EQA problem, a novel framework based on neural program synthesis reasoning is proposed, where the joint reasoning of the external knowledge and 3D scene graph is performed to realize navigation and question answering. Especially, the 3D scene graph can provide the memory to store the visual information of visited scenes, which significantly improves the efficiency for the multi-turn question answering. Experimental results have demonstrated that the proposed framework is capable of answering more complicated and realistic questions in the embodied environment. The proposed method is also applicable to multi-agent scenarios 

Mixed Neural Voxels for Fast Multi-view Video Synthesis 

Synthesizing high-fidelity videos from real-world multiview input is challenging due to the complexities of realworld environments and high-dynamic movements. Previous works based on neural radiance fields have demonstrated high-quality reconstructions of dynamic scenes. However, training such models on real-world scenes is timeconsuming, usually taking days or weeks. In this paper, we present a novel method named MixVoxels to efficiently represent dynamic scenes, enabling fast training and rendering speed. The proposed MixVoxels represents the 4D dynamic scenes as a mixture of static and dynamic voxels and processes them with different networks. In this way, the computation of the required modalities for static voxels can be processed by a lightweight model, which essentially reduces the amount of computation as many daily dynamic scenes are dominated by static backgrounds. To distinguish the two kinds of voxels, we propose a novel variation field to estimate the temporal variance of each voxel. For the dynamic representations, we design an inner product time query method to efficiently query multiple time steps, which is essential to recover the high-dynamic movements. As a result, with 15 minutes of training for dynamic scenes with inputs of 300-frame videos, MixVoxels achieves better PSNR than previous methods. For rendering, MixVoxels can render a novel view video with 1K resolution at 37 fps. 

Masked Space-Time Hash Encoding for Efficient Dynamic Scene Reconstruction  

We propose the Masked Space-Time Hash encoding (MSTH), a novel method for efficiently reconstructing dynamic 3D scenes from multi-view or monocular videos. Based on the observation that dynamic scenes often contain substantial static areas that result in redundancy in storage and computations, MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding and a 4D hash encoding. The weights for the two components are represented by a learnable mask which is guided by an uncertainty-based objective to reflect the spatial and temporal importance of each 3D position. With this design, our method can reduce the hash collision rate by avoiding redundant queries and modifications on static areas, making it feasible to represent a large number of space-time voxels by hash tables with small size. Besides, without the requirements to fit the large numbers of temporally redundant features independently, our method is easier to optimize and converge rapidly with only twenty minutes of training for a 300-frame dynamic scene. As a result, MSTH obtains consistently better results than previous methods with only 20 minutes of training time and 130 MB of memory storage. 

Risk-Aware Decision Making for Human–Robot Collaboration With Effective Communication 

Human-robot collaboration is crucial for integrating robots into intelligent manufacturing (IM). However, a significant challenge is rational decision-making for the human-cyberphysical system (HCPS) in IM to enhance cognitive limits of human operators and overcome the potential irrationality. Since humans dominate the collaboration in IM, it is essential to address two critical issues: determining the next task for the robot and deciding whether the human operator should be informed. We propose a risk-aware decision-making framework for task allocation and human-robot interaction (HRI) to achieve a balance between the autonomy level of the human operator and task efficiency. To quantify the efficiency risk, we utilize conditional value-at-risk (CVaR) considering the uncertainty of human operators. We then obtain the optimal task allocation and selection for the robot by minimizing the efficiency risk. We also establish a necessary collection of tasks that must be performed by the human operator. Furthermore, we develop two criteria to quantify the necessity of explicit HRI. Experiments with a real mechanical arm platform demonstrate that our methods can enhance human-robot collaboration (HRC), reduce the need for extensive communication, and grant human operators greater execution freedom. 

Embodied Multi-Agent Task Planning from Ambiguous Instruction

We propose an embodied multi-agent task planning framework to utilize external knowledge sources and dynamically perceived visual information to resolve the high-level instructions, and dynamically allocate the decomposed tasks to multiple agents. Furthermore, we utilize the semantic information to perform environment perception and generate sub-goals to achieve the navigation motion. This model effectively bridges the difference between the simulation environment and the physical environment, thus it can be simultaneously applied in both simulation and physical scenarios and avoid the notorious sim2real problem. Finally, we build a benchmark dataset to validate the embodied multi-agent task planning problem, which includes three types of high-level instructions in which some target objects are implicit in instructions. We perform the evaluation experiments on the simulation platform and in physical scenarios, demonstrating that the proposed model can achieve promising results for multi-agent collaborative tasks.

Multi-Agent Embodied Semantic Navigation

We propose the multi-agent visual semantic navigation, in which multiple agents collaborate with others to find multiple target objects. It is a challenging task that requires agents to learn reasonable collaboration strategies to perform efficient exploration under the restrictions of communication bandwidth. We develop a hierarchical decision framework based on semantic mapping, scene prior knowledge, and communication mechanism to solve this task.The experimental results in unseen scenes with both seen objects and unseen objects illustrate the higher accuracy and efficiency of the proposed model compared with the single-agent model.

Multi-Agent Embodied Question Answering

We address this new problem in two stages: Multi-Agent 3D Reconstruction in Interactive Environments and Question Answering. Our proposed framework features multi-layer structural and semantic memories shared by all agents, as well as a question answering model built upon a 3DCNN network to encode the scene memories. During the reconstruction, agents simultaneously explore and scan the scene with a clear division of work, organized by next viewpoints planning.We evaluate our framework on the IQuADv1 dataset and outperform the IQA baseline in a singleagent scenario. In multi-agent scenarios, our framework shows favorable speedups while remaining high accuracy.

Manipulation Question Answering

We propose a novel task, Manipulation Question Answering (MQA), where the robot performs manipulation actions to change the environment in order to answer a given question. To solve this problem, a framework consisting of a QA module and a manipulation module is proposed. For the QA module, we adopt the method for the Visual Question Answering (VQA) task. For the manipulation module, a Deep Q Network (DQN) model is designed to generate manipulation actions for the robot to interact with the environment. We consider the situation where the robot continuously manipulating objects inside a bin until the answer to the question is found. Besides, a novel dataset that contains a variety of object models, scenarios and corresponding question-answer pairs is established in a simulation environment. Extensive experiments have been conducted to validate the effectiveness of the proposed framework.

Visual-Auditory-Tactile Perception for Langauage Grounding 

It has always been expected that the robot can understand the natural language instruction and thus a more natural human-robot interaction is achieved. Currently, the robot usually interprets the instruction by visually grounding the textual information to its surroundings, while it may be not enough for some complex situations with only visual perception. So it is reasonable for the robot to leverage its multisensory perception ability to better understand the instruction. In this paper, we propose a multisensory perception approach to tackle the task of natural language instruction understanding for robotic manipulation, in which the robot coordinates its visual, tactile and auditory perception to fully understand the instruction and then executes the manipulation task. Extensive experiments have been conducted demonstrating the superiority of the multisensory perception compared with single sensory perception for instruction understanding. Moreover, we establish a user-friendly human-robot interaction interface where the human sends instruction to the robot via a mobile APP 

Auditory-Visual Grounding Referring Expression

Referring expressions are commonly used when referring to a specific target in people’s daily dialogue. In this paper, we develop a novel task of audio-visual grounding referring expression for robotic manipulation. The robot leverages both the audio and visual information to understand the referring expression in the given manipulation instruction and the corresponding manipulations are implemented. To solve the proposed task, an audio-visual framework is proposed for visual localization and sound recognition. We have also established a dataset which contains visual data, auditory data and manipulation instructions for evaluation. Finally, extensive experiments are conducted both offline and online to verify the effectiveness of the proposed audio-visual framework. And it is demonstrated that the robot performs better with the audio-visual data than with only the visual data.

Embodied Semantic Scene Graph Generation

We propose a new task of Embodied Semantic Scene Graph Generation, which exploits the embodiment of the intelligent agent to autonomously generate an appropriate path to explore the environment for scene graph generation. To this end, a learning framework with the paradigms of imitation learning and reinforcement learning is proposed to help the agent generate proper actions to explore the environment and the scene graph is incrementally constructed. The proposed method is evaluated on the AI2Thor environment using both the quantitative and qualitative performance indexes. Additionally, we implement the proposed method on a streaming video captioning task and promising experimental results are achieved.

Remote Embodied Visual Referring Expression in Continuous Environment

To make the REVERIE task more consistent with the real physicalworld,we develop a new task of Remote Embodied Visual Referring Expression in Continuous Environment, namely REVE-CE, in which the agent executes a much longer sequence of low-level actions given language instructions. Furthermore, we propose a multi-branch cross modal attention (MBCMA) framework to solve the proposed REVE-CE task. Extensive experiments are conducted demonstrating that the proposed framework greatly outperforms the state-of-the-art VLN baselines and a new benchmark for the proposed REVE-CE task is built.

Embodied Scene Description

We propose the Embodied Scene Description, which exploits the embodiment ability of the agent to find an optimal viewpoint in its environment for scene description tasks. A learning framework with the paradigms of imitation learning and reinforcement learning is established to teach the intelligent agent to generate corresponding sensorimotor activities. The proposed framework is tested on both the AI2Thor dataset and a real world robotic platform demonstrating the effectiveness and extendability of the developed method. 

Adversarial Skill Learning for Robust Manipulation

Deep reinforcement learning has made significant progress in robotic manipulation tasks and it works well in the ideal disturbance-free environment. However, in a real-world environment, both internal and external disturbances are inevitable, thus the performance of the trained policy will dramatically drop. To improve the robustness of the policy, we introduce the adversarial training mechanism to the robotic manipulation tasks in this paper, and an adversarial skill learning algorithm based on soft actor-critic (SAC) is proposed for robust manipulation. Extensive experiments are conducted to demonstrate that the learned policy is robust to internal and external disturbances. Additionally, the proposed algorithm is evaluated in both the simulation environment and on the real robotic platform.

Lifelong Learning on Robotic Manipulation Tasks

Overcoming catastrophic forgetting is of great importance for deep learning and robotics. Recent lifelong learning research has great advances in supervised learning. However, little work focuses on reinforcement learning(RL). We focus on evaluating the performances of state-of-the-art lifelong learning algorithms on robotic reinforcement learning tasks. We mainly focus on the properties of overcoming catastrophic forgetting for these algorithms. We summarize the pros and cons for each category of lifelong learning algorithms when applied in RL scenarios. We propose a framework to modify supervised lifelong learning algorithms to be compatible with RL.We also develop a manipulation benchmark task set for our evaluations.

Continual Learning with Recursive Gradient Optimization

Learning multiple tasks sequentially without forgetting previous knowledge, called Continual Learning (CL), remains a long-standing challenge for neural networks. Most existing methods rely on additional network capacity or data replay. In contrast, we introduce a novel approach which we refer to as Recursive Gradient Optimization (RGO). RGO is composed of an iteratively updated optimizer that modifies the gradient to minimize forgetting without data replay and a virtual Feature Encoding Layer (FEL) that represents different network structures with only task descriptors. Experiments demonstrate that RGO has significantly better performance on popular continual classification benchmarks when compared to the baselines and achieves new state-of-the-art performance on 20-split-CIFAR100 (82.22%) and 20-split-miniImageNet (72.63%). With higher average accuracy than Single-Task Learning (STL), this method is flexible and reliable to provide continual learning capabilities for learning models that rely on gradient descent.

Self-Supervised Learning

Unsupervised learning methods based on contrastive learning have drawn increasing attention and achieved promising results. Most of them aim to learn representations invariant to instance-level variations, which are provided by different views of the same instance. In this paper, we propose Invariance Propagation to focus on learning representations invariant to category-level variations, which are provided by different instances from the same category. Our method recursively discovers semantically consistent samples residing in the same high-density regions in representation space. We demonstrate a hard sampling strategy to concentrate on maximizing the agreement between the anchor sample and its hard positive samples, which provide more intra-class variations to help capture more abstract invariance. As a result, with a ResNet-50 as the backbone, our method achieves 71.3% top-1 accuracy on ImageNet linear classification and 78.2% top-5 accuracy fine-tuning on only 1% labels, surpassing previous results. We also achieve state-of-the-art performance on other downstream tasks, including linear classification on Places205 and Pascal VOC, and transfer learning on small scale datasets.