Selective Research Papers

Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

Biswesh Mohapatra*, Giovanni Duca*, Laurent Romary, and Justine Cassell.

Under Review.

Keywords - Situated Dialogue, Multimodal Common Ground Representation, Multimodal Retrieval Augmented Generation, Visual Scaffolding, Machine Mental Imagery.

Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call representational blur, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.

Paper Link

SpeechMapper: Speech-to-text Embedding Projector for LLMs.

Biswesh Mohapatra*, Marcely Zanon Boito*, and Ioan Calapodescu.

In ICASSP 2026 (Oral).

Keywords - Speech-LLMs, Bridge modalities

Current speech LLMs bridge speech foundation models to LLMs using projection layers, training all of these components on speech instruction data. This strategy is computationally intensive and susceptible to task and prompt overfitting. We present SpeechMapper, a cost-efficient speech-to-LLM-embedding training approach that mitigates overfitting, enabling more robust and generalizable models. Our model is first pretrained without the LLM on inexpensive hardware, and then efficiently attached to the target LLM via a brief 1K-step instruction tuning (IT) stage. Through experiments on speech translation and spoken question answering, we demonstrate the versatility of SpeechMapper’s pretrained block, presenting results for both task-agnostic IT, an ASR-based adaptation strategy that does not train in the target task, and task-specific IT. In task-agnostic settings, Speechmapper rivals the best instructionfollowing speech LLM from IWSLT25, despite never being trained on these tasks, while in task-specific settings, it outperforms this model across many datasets, despite requiring less data and compute. Overall, SpeechMapper offers a practical and scalable approach for efficient, generalizable speech-LLM integration without large-scale IT.

Paper Link

Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs.

Biswesh Mohapatra, Théo Charlot, Giovanni Duca, Mayank Palan, Laurent Romary, and Justine Cassell.

In Findings of ACL 2026.

Keywords - Common Ground Representation, Long-Term Grounding, Synthetic Dialogue Generation Pipeline, Benchmark, Reinforcement Learning for Coversational Grounding

Common ground plays a critical role in situated spoken dialogs, where interlocutors must establish and maintain shared references to entities, events, and relations to sustain coherent interaction in a shared space and over time. With the increasing presence of embodied conversational agents and social robots, the ability to correctly ground this kind of conversational content in order to refer back later also becomes important for dialog systems. Prior studies have demonstrated that LLMs are capable of performing certain grounding acts like acknowledgments. However, relatively little work has investigated their capacity to leverage the grounded information, like in complex scenarios involving space and time (e.g., "let's go to that café near the park we went to yesterday"). To that end, in this work, we evaluate a model's ability to establish common ground by utilizing these "relational references" in the dynamic and shared environments of situated dialogs. We then test multiple methods for representing common ground and further propose approaches to improve their performance by using reinforcement learning on our synthetically generated dialog data .

Paper Link

Evaluating the Effectiveness of Large Language Models in Establishing Conversational Grounding.

Biswesh Mohapatra, Manav Nitin Kapadnis, Laurent Romary, and Justine Cassell.

In EMNLP 2024.

Keywords - Conversational Grounding, Automatic Evaluation Framework, Diagnostic Framework for Grounding, Reward-Based Training

Conversational grounding, vital for building dependable dialog systems, involves ensuring a mutual understanding of shared information. Despite its importance, there has been limited research on this aspect of conversation in recent years, especially after the advent of Large Language Models (LLMs). Previous studies have highlighted the shortcomings of pre-trained language models in conversational grounding. However, most testing for conversational grounding capabilities involves human evaluations that are costly and time-consuming. This has led to a lack of testing across multiple models of varying sizes, a critical need given the rapid rate of new model releases. This gap in research becomes more significant considering recent advances in language models, which have led to new emergent capabilities. In this paper, we aim to evaluate the performance of LLMs in various aspects of conversational grounding and analyze why some models perform better than others. We demonstrate a direct correlation between the size of the pre-training data and conversational grounding abilities, meaning that they have independently acquired a specific form of pragmatic capabilities from larger pre-training datasets. Finally, we propose ways to enhance the capabilities of the models that lag in this aspect.

Paper Link

Conversational Grounding: Annotation and Analysis of Grounding Acts and Grounding Units.

Biswesh Mohapatra, Seemab Hassan, Laurent Romary, and Justine Cassell.

In LREC-COLING 2024.

Keywords - Conversational Grounding, Annotation, Analysis.

Successful conversations often rest on common understanding, where all parties are on the same page about the information being shared. This process, known as conversational grounding, is crucial for building trustworthy dialog systems that can accurately keep track of and recall the shared information. The proficiencies of an agent in grounding the conveyed information significantly contribute to building a reliable dialog system. Despite recent advancements in dialog systems, there exists a noticeable deficit in their grounding capabilities. Traum (Traum, 1995) provided a framework for conversational grounding introducing Grounding Acts and Grounding Units, but substantial progress, especially in the realm of Large Language Models, remains lacking. To bridge this gap, we present the annotation of two dialog corpora employing Grounding Acts, Grounding Units, and a measure of their degree of grounding. We discuss our key findings during the annotation and also provide a baseline model to test the performance of current Language Models in categorizing the grounding acts of the dialogs. Our work aims to provide a useful resource for further research in making conversations with machines better understood and more reliable in natural day-to-day collaborative dialogs.

Paper Link

Simulated Chats for Building Dialog Systems: Learning to Generate Conversations from Instructions.

Biswesh Mohapatra, Gaurav Pandey, Danish Contractor and Sachindra Joshi.

In Findings of EMNLP 2021.

Keywords - Synthetic Dialogue Generation, Task Oriented Dialogue.

Popular dialog datasets such as MultiWOZ are created by providing crowd workers an instruction, expressed in natural language, that describes the task to be accomplished. Crowd workers play the role of a user and an agent to generate dialogs to accomplish tasks involving booking restaurant tables, calling a taxi etc. In this paper, we present a data creation strategy that uses the pre-trained language model, GPT2, to simulate the interaction between crowd workers by creating a user bot and an agent bot. We train the simulators using a smaller percentage of actual crowd-generated conversations and their corresponding instructions. We demonstrate that by using the simulated data, we achieve significant improvements in low-resource settings on two publicly available datasets - MultiWOZ dataset and the Persona chat dataset.

Paper Link

Why Settle for Just One? Extending EL++ Ontology Embeddings with Many-to-Many Relationships.

Biswesh Mohapatra, Sumit Bhatia, Raghava Mutharaju, G. Srinivasaraghavan.

Won SEMREC challenge, ISWC 2021.

Keywords - Ontology Embeddings, Descriptive Logic, EL++, n-ball.

Knowledge Graph (KG) embeddings provide a low-dimensional representation of entities and relations of a Knowledge Graph and are used successfully for various applications such as question answering and search, reasoning, inference, and missing link prediction. However, most of the existing KG embeddings only consider the network structure of the graph and ignore the semantics and the characteristics of the underlying ontology that provides crucial information about relationships between entities in the KG. Recent efforts in this direction involve learning embeddings for a Description Logic (logical underpinning for ontologies) named EL++. However, such methods consider all the relations defined in the ontology to be one-to-one which severely limits their performance and applications. We provide a simple and effective solution to overcome this shortcoming that allows such methods to consider many-to-many relationships while learning embedding representations. Experiments conducted using three different EL++ ontologies show substantial performance improvement over five baselines. Our proposed solution also paves the way for learning embedding representations for even more expressive description logics such as SROIQ.

Paper Link

Incorporating Autonomous Bargaining Capabilities into E-Commerce Systems.

Ananth Shreekumar*, Biswesh Mohapatra*, and Shrisha Rao.

In IVA 2020.

Keywords - E-Commerce, Autonomous Bargaining, Thomas-Kilmann conflict mode instrument.

Bundling is a technique e-commerce companies have adopted from traditional retail stores to increase the average order size. It has been observed that bargaining helps increase customer satisfaction while increasing the average order revenue for retailers. We propose a mathematical framework to incorporate bargaining capabilities with the product bundles provided by e-commerce websites. Our method creates a virtual agent that uses the modular Bidding-Opponent-Acceptance model for its bargaining strategy and the Thomas-Kilmann conflict mode instrument to model buyer behavior. We incorporate bargaining capabilities with bundles in an e-commerce system by using a negotiation agent that uses business logic for better strategy. It uses real-time data generated during a negotiation session, since the buyer behavior during a negotiation is crucial. No requirement exists for data from past negotiation sessions of the buyer, which removes bias as well as allowing for rapid changes to buyer behavior. The agent behavior can be altered by various hyperparameters. Our model provides utility metrics to measure buyer and agent satisfaction. Our results show that the agent successfully negotiates with humans from diverse backgrounds.

Paper Link

Page updated

Google Sites

Report abuse