Proceedings

Filippo Cavallo, Erika Rovini, Laura Fiorini - Department of Industrial Engineering, University of Florence, Italy

Enhancing Aging in Place Through Telepresence Robotics: Evidence from a 12-Month Real-World Study

Population aging and increasing social isolation among older adults demand scalable and effective solutions to support aging in place. Telepresence robots have shown promise in enhancing social connectedness and enabling remote caregiving; however, evidence from long-term real-world deployments remains limited. This paper presents a 12-month field study evaluating a telepresence robotic service integrated into domiciliary care for frail older adults. Its impact on quality of life, loneliness, resilience, and perceived social support was assessed and compared with tablet-based communication, SmartTV-based systems, and standard care. Quantitative results, complemented by multi-stakeholder qualitative insights, indicate that the telepresence robot achieved the most consistent benefits, including improvements in quality of life (+40%), perceived social support (+11.1%), and reduced loneliness (−3.13%). Caregivers reported enhanced flexibility in care delivery and a strong willingness to continue using the service. These findings suggest that telepresence robots can represent an effective and scalable solution to enhance social inclusion and support care processes, particularly when tailored to users’ characteristics and care contexts.

David Scott Lewis, Enrique Zueco

Telepresence systems increasingly combine robotics, shared autonomy, XR, and multimodal feedback, yet most controllers still optimize task completion, safety, or latency while treating presence, agency, workload, and discomfort as offline evaluation targets. The literature already shows that interface design, avatar behavior, haptics, viewpoint control, and autonomy allocation materially change the quality of embodied remote collaboration. What is still missing is a closed loop that makes those human-centered quantities control variables rather than post-hoc metrics. We propose Embodied Presence Loops (EPL), a framework for online self-calibration of telepresence systems. EPL estimates a latent user–system state from interaction traces, proxemic events, network quality, optional physiological streams, and sparse subjective probes, then adapts feedback modalities, viewpoint assistance, and autonomy authority online to preserve agency without sacrificing task performance. The paper con- tributes (1) a multimodal state-space formalization over presence, agency, workload, discomfort, link quality, and task risk, (2) an agency-constrained controller for modality routing and shared- autonomy scheduling, (3) a benchmark protocol spanning remote collaboration, hazardous inspection, collaborative locomotion, and assistive telepresence, and (4) a control-in-the-loop sim- ulation illustrating the framework’s adaptive behavior under link degradation. Unlike policy-repair approaches that modify the executable task graph, EPL self-calibrates the human-facing interface itself.

Embodied Presence Loops: Multimodal Self-Calibration for Agency-Aware Telepresence

David Scott Lewis , Saien Deng, Xiping Gong, Karl Wang, Enrique Zueco

PATCH-BT: Agency-Preserving Behavior-Tree Self-Repair for Shared-Autonomy Telepresence

Shared-autonomy telepresence robots must recover from blocked paths, stale localization, intent shifts, incomplete handovers, and unsafe social approaches without eroding the operator’s sense of agency. Existing embodied self-correction sys- tems usually retry actions or replan remaining suffixes, but they rarely persist minimal repairs in the executable policy itself. We present PATCH-BT, a training-free behavior-tree self-repair layer that operationalizes embodied recursive self-improvement (RSI) at test time. PATCH-BT continuously monitors task progress, safety, human-state, and link-quality signals; localizes a failure frontier in the behavior tree (BT); selects the smallest safe typed tree edit under an agency-preservation penalty; and stores successful repairs as abstract motifs for future compilation. We evaluate PATCH-BT in a timing-aware control-in-the-loop telepresence simulator with dynamic obstacles, moving humans, ambiguous intent, and 120–350 ms latency across remote fetch, cooperative handover, and remote inspection. Over 20 seeds × 180 episodes, PATCH-BT reaches 92.0% success versus 78.1% for Local-Patch and 67.5% for Replan, while reducing overrides and improving an explicit agency proxy (0.84 vs. 0.61 and 0.39). Ablations show that both memory and proactive compilation matter. These results suggest that RSI fits telepresence best as sparse, auditable policy repair rather than open-ended autonomy.

Muxin Liu, Mingxuan Li, Kenneth Shaw, Deepak Pathak

Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and se- mantically understanding object parts, even in cluttered scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the generated dataset is distilled into a diffusion model that can operate on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force- closure, IFG achieves high-performance semantic grasping without any manually collected training data.

IFG: Internet-Scale Guidance for Functional Grasping Generation

Chandra Mohan Singh Negi, Gaurav Harit - IIT Jodhpur & Avinash Kumar - TU Munich

Shared Autonomy for Industrial Manipulation: A Vision-Language-Action Framework for Adaptive Teleoperation

Industrial teleoperation demands both precision and flexibility, yet existing systems struggle to balance autonomous efficiency with human expertise. We present Discrete Hybrid VLA, a vision-language-action (VLA) framework that enables confidence-gated graduated autonomy for remote industrial manipulation tasks. Our architecture couples a Florence-2-Base visual encoder (238 M parameters) with a T5-Small language encoder (60 M) and a vector-quantized (VQ) action codebook (K=256), explicitly designed to satisfy a 50 ms hard inference budget on NVIDIA Jetson Orin NX. The VQ discrete action tokenization reformulates manipulation action prediction as single-pass classification, reducing action head inference from O(d) sequential operations to O(1) forward passes. Through structured attention-head pruning, INT8 mixed-precision post-training quantization (PTQ), and Tensor RT static-shape compilation, we achieve 47 ms end-to-end latency — a 6.6× speedup over the FP32 baseline — while attaining 85.9% task success on the LIBERO benchmark. Industrial validation on battery cable assembly and programmable logic controller (PLC) maintenance with industrial technicians demonstrates 94% task success, 62% fewer operator interventions versus rule-based baselines, and 44% reduction in cognitive load (NASA Task Load Index, NASA-TLX: 38 vs. 68 for manual control). All failures were recoverable through human intervention with zero safety incidents.

Nurtdinov Damir, Alexei Kornaev, Alexander Maloletov

Adapting a pretrained vision-language-action (VLA) policy to a new robot typically assumes the availability of embodiment-specific demonstrations. This assumption is restric- tive for custom telepresence systems and other low-data em- bodiments, where collecting task demonstrations is expensive before even basic autonomous behavior exists. We present an RL bootstrapping approach for zero-demo embodiment alignment: before task-level imitation or manipulation training, a pretrained VLA backbone is adapted in simulation through reinforcement learning on simple language-conditioned motion primitives. The key idea is to use simulator-side state access to construct dense rewards for primitive commands such as move up, move left, and move center, allowing the policy to first learn how to express language-conditioned motion through a previously unseen control interface. We instantiate this approach on a cable-driven parallel robot (CDPR) telepresence platform. The main contribution is the RL bootstrapping recipe itself: an embodiment-first training stack, a primitive-instruction curriculum, and an evaluation protocol for measuring instruction-level success before scaling to object-centric tasks.

RL Bootstrapping of Language-Conditioned Control for a Novel Robot Embodiment

Riccardo Kristen Simi, Samuele Bordini, Giorgio Grioli, Lucia Pallottino - Università di Pisa & Yassine El Houm, Maide Bucolo - Università di Catania

BCI-Based Supervisory Control with the Alter-Ego Assistive Humanoid in Domestic Settings

Brain-Computer Interface (BCI) is a system that provides a direct communication channel between brain neural activity and external systems, bringing transformative assistive technology for individuals with motor impairments. However, among non-invasive paradigms, the number of reliable control signals is inherently limited, making it difficult to translate user intentions into a wide range of different complex actions, especially when interacting with everyday real-world scenarios. In this work, we propose a comprehensive BCI framework that integrates a robust classification pipeline with a Finite State Machine interface, enabling the translation of only two motor imagery commands, i.e., imagining left- or right-hand movement, into high-level actions. The system allows the user to control the Alter-Ego robot as a physical avatar to enable navigation and physical interaction with the real-world environment. Experi- mental validation across seven subjects and multiple training scenarios demonstrates high classification accuracy up to nearly 97% and robust generalization, enabling reliable and intuitive real-time teleoperation in both local and remote settings. These results highlight the effectiveness of the proposed approach in bridging low-dimensional neural control with complex assistive tasks, contributing to the development of embodied telepresence systems based on supervisory control.

Balint Varga - Karlsruhe Institute of Technology & Tamás Haidegger - Óbuda University

Teleoperation has been a central research topic for over six decades, with envisioned applications in power plants, extreme exploration, surgery, remote driving, industrial maintenance, and care robotics. Despite substantial technolog- ical progress, teleoperated systems remain far from ubiquitous deployment. This paper revisits the “teleoperation paradox” – the persistent gap between research maturity and real-world adop- tion – and investigates where the effective bottlenecks lie. Building on recent analyses of network capabilities and human factors, we argue that for most teleoperation applications, contemporary 5G communication infrastructures can meet key latency and bandwidth requirements under favorable conditions. At the same time, human factors – including situation awareness, cognitive load, expertise, and interface design – increasingly constrain performance, even when basic communication requirements are satisfied. This paper (i) summarizes teleoperation network re- quirements and compares them to 5G/6G capabilities, (ii) relates network evolution to human perceptual and cognitive limits, and (iii) structures operator limitations. Overall, our results motivate a shift in research emphasis toward human-centered design, operator training, advanced shared control algorithms, and safety-by-design solutions.

Teleoperation Bottlenecks: From Network-Centric Optimization to Human-Centered Design

Ziluo Ding

Dual-Level Humanoid Whole-Body Controller

This paper presents JAEGER, a dual-level whole- body controller for humanoid robots that addresses the challenges of training a more robust and versatile policy. Unlike traditional single-controller approaches, JAEGER separates the control of the upper and lower bodies into two independent controllers, so that they can better focus on their distinct tasks. This separation alleviates the dimensionality curse and improves fault tolerance. JAEGER supports both root velocity tracking (coarse- grained control) and local joint angle tracking (fine-grained control), enabling versatile and stable movements. To train the controller, we utilize a human motion dataset (AMASS), retargeting human poses to humanoid poses through an efficient retargeting network, and employ a curriculum learning ap- proach. This method performs supervised learning for initializa- tion, followed by reinforcement learning for further exploration. We conduct our experiments on two humanoid platforms and demonstrate the superiority of our approach against state-of- the-art methods in both simulation and real environments.

Page updated

Google Sites

Report abuse