SIL is a framework for co-adaptive human–robot interaction (HRI). The state-of-the-art language-conditioned HRI frameworks perpetuate a master-apprentice model, where the apprentice (embodied agent) passively receives and executes the master’s (human’s) commands without reciprocal learning. This one-way reactive interaction approach does not capture the co-adaptive dynamics inherent in everyday multi-turn human-human interactions. SIL reimagines human-agent interaction as a dynamic, co-adaptive process. Rather than treating the human as a fixed command source and the agent as a passive executor, SIL models both as adaptive systems that jointly maintain and align their belief states within a shared latent task space, enabling dynamic, bidirectional co-adaptation akin to natural human–human interaction.
Formalisation: Co-adaptation modelled as belief-state evolution over interaction history.
Capabilities and Features: Anti-forgetting, adaptive learning, proactive suggestions, shared plan refinement, bidirectional interaction, voice recognition, etc.
Architecture:
Foundation models (FMs) for spatial perception & reasoning.
Lightweight latent encoder for task-specific grounding.
Memory and continual learning safeguards protect against catastrophic forgetting of learned task representations.
Uncertainty-aware language estimation to ensure robust intent recognition and prevent unsafe execution of ambiguous instructions.
We conducted experiments in both simulated and real-world environments to validate the full potential of SIL. We focus on its co-adaptive mechanisms for belief alignment, memory retention, and preference learning. We evaluated across five key dimensions: (i) instruction execution under ambiguity and temporal complexity, (ii) long-term memory and retention, (iii) contextual reasoning, (iv) clarification and proactive dialogue, and (v) preference-based personalisation.
Quantitatively, we evaluated SIL with the following metrics: (i) Task Completion Rate (TCR): this represents the percentage of correctly executed tasks. (ii) Belief Alignment (BA): This quantifies the weighted cosine similarity between human and agent belief embeddings. (iii) Clarification Efficiency (CE): Represents the average number of clarification requests per successful task.
Performance evaluation of SIL across different task domains. Tasks include: Embodied Instruction Following (EIF), Memory-Based Interactive Information Retrieval (MIIR), Query-Oriented Reasoning (QOR), Proactive Dialogue and Suggestion (PDS), and Long-Term Preference Learning (LPL).
Task success rate across domains and ablated variants. Full SIL: All components included. w/o Co-Adaptation: SIL without bidirectional belief updating. w/o EWC: SIL without continual learning. w/o Human Pref.: SIL without preference mechanism. w/o Memory: SIL without memory. w/o Uncertainty: SIL without uncertainty quantification.
Belief alignment (BA) across multi-turn interactions. Full SIL (blue) exhibits rapid convergence toward a stable equilibrium (BA approx. 0.83), maintaining high alignment throughout. In contrast, ablations without co-adaptation, EWC, human preference modelling, memory, or uncertainty handling exhibit unstable trajectories (BA approx. 0.52 - 0.65) and fail to achieve strong alignment.
Ablation study on SIL's core architecture. Metrics are averaged across all task categories. Ablating the co-adaptation mechanism caused the most significant performance drop, reducing the performance to the level of the static LLM baseline.
Here, we show qualitative examples of SIL in multi-turn interaction tasks. Yellow paths indicate the agent’s navigation trajectories. The user issued several commands (Int 1 - Int 14) that require logical reasoning over spatial constraints, conditional navigation, anti-forgetting, preference retention, and continual learning.
Int 1: The user issued a multistep navigation command (Navigate between the passage way and the location (2, 1, 0), and return to the current location).
Int 2: The user probes episodic recall (What task did you just perform? ).
Int 3: The user requested task replay (Repeat Int 1).
Int 4: The user issued constrained-based navigation (If the round trip between the coordinates (2, 2) and (-3, -1) would take more than 20s, navigate between them twice; otherwise, make a circle of radius 0.75m at the current location ).
Int 5: The user requested a task replay of the unmet condition in Int 4 (Move between the coordinates twice).
Int 6: The user directs the SIL agent to the starting point and provide scene report (Return to the origin and describe what you can see ).
Int 7: The user queries for an object and issues a navigation command to the possible object location (Where can I find a spoon. Take me to the location).
Int 8: The user shifts intent, inquires for a location for academic activities (I want to make professional academic inquires. Take me the possible location ).
Int 9: The user issued a command with an ambiguous reference (Head there and return here quickly).
Int 10: The user rejects the SIL agent's proactive suggestions from Int 9 and reframes intent (No, I mean to the location where I can relax and enjoy nature ).
Int 11: The user introduces a new preference as a persistent alias (Patrol mode means head to the Prof. and Sec. office and send photos of what you can see).
Int 12: The user issues a distractor task (Go to the hallway and return here ).
Int 13: The user invokes the alias (Int 11) after the distractor task in Int 12 (Now patrol mode).
Int 14: The user commands the SIL agent to return the initial point and make a geometric move (Return to the origin and make a circle of 0.5m radius ).
If you use this work in your research, please cite it using the following BibTeX entry:
@article{xxxxxxxxx,
author={author1, author2, author3, author4, author5},
booktitle={International Conference on xxxxxxxx},
title={Beyond Master and Apprentice: Grounding Foundation Models for Symbiotic Interactive Learning in a Shared Latent Space},
year={2026},
volume={},
number={},
pages={xxx-xxx}
doi={xxxxxx}}
This work received funding from the xxxxxx (dddd, rrrrrrr) - No #dddddddd (rrrrrrr).: