🔍Conan: Active Reasoning in an Open-World Environment

Manjie Xu1, Guangyuan Jiang2, Wei Liang1, 3, ✉️, Chi Zhang4, ✉️, Yixin Zhu2, ✉️

1School of Computer Science & Technology, Beijing Institute of Technology2Institute for AI, Peking University3Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China4National Key Laboratory of General Artificial Intelligence
NeurIPS 2023
Arxiv, Code, Paper

Motivation

Active interaction with the environment is fundamental to human understanding of the world around us. Both neural and behavioral studies indicate that through active engagement with their surroundings, humans garner critical insights and foster a profound understanding of complex phenomena. When confronted with partial or ambiguous data, our innate response is to seek supplementary evidence, hypothesize, and put forth possible explanations, sometimes even reevaluating initial assumptions. This iterative process persists until a satisfactory resolution emerges.


We present 🔍Conan to capture the dynamic and exploratory essence of abductive reasoning—termed herein as active reasoning.

🔍Conan

🔍Conan is a new open-world environment tailored for abductive reasoning. Standing head and shoulders above traditional single-round passive reasoning benchmarks, 🔍Conan boasts an open-world arena, urging agents to actively probe surroundings and engage in multi-round abductive inferences, all while leveraging in-situ collected evidence alongside pre-existing knowledge.

At its core, 🔍Conan is conceived as a detective game, transmuted into a question-answering challenge. Here, the detective is tasked with a query and an “incident scene” riddled with traces left by a vandal. Given the initial paucity of conclusive information, the detective must embark on an in-depth exploration of the scene. As the inquiry progresses, the detective has the opportunity to actively scout its environment, continually reshaping and honing its hypotheses, especially when new revelations potentially contradict the prior hypothesis.We meticulously craft questions within        🔍Conan to span various levels of abstraction, from localized intentions (Intent) to overarching objectives (Goal) and survival states (Survival).

Playground

🔍Conan offers an extensive assortment of interactive items: food, materials, mobs, and tools, each tied to specific actions, as illustrated. It furnishes 26 unique actions to foster agent-environment engagement. Certain actions leave traces, and together, the items and their mechanics provide a rich set of affordances for agents in the playground. This knowledge about item operations and traces aids the detective in comprehending the incident scene. Advancing from its predecessor, the original Crafter, 🔍Conan now boasts over 30 achievements. It features 32 distinct traces covering all agent actions such as crafting, collecting, defeating, eating, drinking, and incurring injuries. This enhancement enables the design of 60 varied abductive reasoning tasks within the scene. 

Questions

🔍Conan is designed to assess the abductive reasoning capability of machine models through a diverse set of questions varying in difficulty and abstraction. These questions fall into three primary categories: Intent (local intent), Goal (global goal), and Survival (agent’s survival status change). 


Intent questions target the vandal’s immediate objectives or intentions during its task. To decipher these traces, agents must deduce the vandal’s underlying intent or subgoals. Solving these questions necessitates a learning model’s comprehension of the local context.

E.g. What did the vandal make on this table?


Goal questions probe the vandal’s overarching objectives, extending beyond immediate intents. They necessitate grasping the wider context of a task or action sequence. Such questions query the vandal’s ultimate aims, demanding a learning model to reason within the broader context of the traces.

E.g. What was the vandal’s primary objective in this scenario?


Survival questions address the wider investigative scope, posing added challenges to the detective. Centered on the vandal’s survival status changes during tasks (e.g., collecting food for sustenance), they lead to deviations from the optimal action plan. These questions require a deeper grasp of the present context, often necessitating reasoning around potential scenarios or alternate results.

E.g. Why did the vandal die in this situation?

The Detective

🔍Conan casts the abductive reasoning challenge as a detective game, necessitating a detective to efficiently explore and gather information from the environment to deduce plausible explanations (i.e., answers) for the given question. This process involves taking into account the temporal dependencies and incompleteness of the traces. To tackle these challenges encountered in 🔍Conan, we devise adetective pipeline.

Building on previous work that utilizes hierarchical models for task decomposition, our pipeline is structured into two primary phases: an exploration phase for trace collection, followed by an abductive reasoning phase. Initially, interaction with the playground is carried out to collect relevant visual information, which is subsequently leveraged in the reasoning phase to infer answers to the posed questions.

Computationally, our pipeline first employs reinforcement learning agents as explorers that learn an exploration policy based on the traces and the question, thereby rendering it goal-oriented. Next, given the question, we recruit vision-language models to predict the answer based on the observation. A key-frame extractor is inserted into the two phases to selectively identify relevant frames for abduction. The individual components undergo separate training procedures.

The Explorer

We compare DQN , TRPO, and RecurrentPPO as the explorer. TRPO and RecurrentPPO manifest similar performance in terms of rewards following a substantial number of steps, markedly surpassing the DQN explorer.

Additionally, we probe the impact of augmenting the maximum number of exploration steps to 5, 000 on performance. The data suggests a marginal performance uplift. Nonetheless, we acknowledge that such a performance increment is at the expense of substantially longer exploration time and a notable surge in the accrual of unrelated information.

The Reasoner

We employ a multi-choice question-answering paradigm to solve 🔍Conan. Specifically, the model is presented with a question, its corresponding exploration frame sequence, and each potential answer choice, subsequently generating a score for each choice. The model is trained with a categorical cross-entropy loss. During inference, the choice with the highest score is considered the answer. We evaluate several well-established multimodal models: Vanilla-Trans, FrozenBiLM, and Flamingo-Mini. Our reasoning models are tested under three different settings: Standard, Ideal Explorer, and AfD. Quantitative results on 🔍Conan are depicted as follows.

In the standard setting, we discern that while models exhibit some aptitude in tackling low-level Intent questions, they struggle with higher-level questions pertaining to Goal and Survival. Among the models, Flamingo-Mini ascends to the pinnacle. FrozenBiLM models also perform relatively well. The DeBERTa variant slightly outperforms BERT, insinuating that a robust language backbone can improve general comprehension.

With the Ideal Explorer, we notice a clear performance boost across all tasks, particularly in the Goal and Survival questions. These results allude to the potential bottlenecking of models’ abductive reasoning capability due to the insufficient information collected, underscoring the significance of effective exploration. Remarkably, the Vanilla-Trans exhibits the greatest increase, insinuating that, in comparison to other baseline models, it is markedly vulnerable to insufficient evidence.

For AfD results, nearly all multimodal models exhibit performance on par with end-to-end supervisedly trained models. Remarkably, FrozenBiLM models even surpass the performance observed in standard settings. Examining task-specific results, a notable performance uplift in the Survival task models is discernible for almost all models relative to the standard setting, albeit sharing the same observation. These results intimate that the inclusion of deductive information sensitizes the detective to vandal’s concerns during task execution.


Citation

If you find 🔍Conan useful, please cite us:


@inproceedings{xu2023conan,

  title={Conan: Active Reasoning in an Open-World Environment},

  author={Xu, Manjie and Jiang, Guangyuan and Liang, Wei and Zhang, Chi and Zhu, Yixin},

  booktitle={NeurIPS},

  year={2023}

}