Enhancing human mobility: From computer vision-based motion tracking to wearable assistive robot control
8:30AM ~ 5:30PM, Friday, May 23 2025
Enhancing human mobility: From computer vision-based motion tracking to wearable assistive robot control
8:30AM ~ 5:30PM, Friday, May 23 2025
Abstract
As wearable robotic devices for human movement assistance and rehabilitation become a translatable technology to the real world, the system’s ability to autonomously and seamlessly adapt to varying environmental conditions and user needs is crucial. Lower-limb exoskeletons and prostheses, for instance, must dynamically adjust their assistance profiles to accommodate different motor activities, such as level-ground walking or stair climbing. To achieve this, it is essential not only to recognize user intentions but also to gather comprehensive information about the surroundings. Computer vision offers rich, direct, and interpretable data that surpasses non-visual sensors like encoders and inertial measurement units, making it a promising tool for enhancing context awareness in wearable robots. However, integrating computer vision into wearable robotic control poses several challenges, including ensuring the real-time feasibility of vision model outputs, maintaining model robustness in diverse mobility contexts and dynamic user movements, and effectively fusing onboard sensor data with visual information. This workshop aims to address these challenges by exploring the latest engineering solutions for computer vision-based human motion tracking and control strategies for wearable robotic systems designed to augment human locomotion. By bridging the gap between researchers in wearable robotics and computer vision, as well as between academia and industry, we seek to provide a roadmap for developing robust, adaptable, and context-aware vision-based control frameworks that can be effectively translated from the lab to real-world applications.
Workshop Abstract Submission (Deadline 4/14/25)
Submission CLOSED
We are pleased to invite a 1-page extended abstract submission for the Enhancing Human mobility: From computer vision-based motion tracking to wearable assistive robots at ICRA 2025, which will be reviewed and selected for a short lightning talk and/or a poster session.
Abstract topics of interest include all aspects of 1) computer vision-based human motion tracking and/or 2) wearable robotic system control including (but not exclusive to): human motion tracking/retargeting using computer vision, generating synthetic motion data, sensor fusion of user or environmental state estimation, robotic exoskeleton or prosthesis control, adaptive wearable robot design and control, and novel sensing methods for wearable robots.
Submission Format: Up to 2 pages excluding acknowledgement, references, and appendix. Please use the standard IEEE conference format (i.e., 2 column format)
Notification: Accepted submissions will be notified over email
If you are already presenting a work at ICRA 2025, you can upload your accepted manuscript.
There will be a monetary prize for the the best presentation for each topic
(Wearable Robot & Computer Vision)
Short Lightning Talk
A series of short talks given by junior researchers in the field (student or postdoctoral researcher)
5 minutes each followed by a 1-minute Q&A
To extend active discussions on relevant topics, we will hold the poster session and a short networking session
Poster Session
A small symposium where both junior and senior researchers interact by presenting work via a poster session
There will be a poster session during each coffee break
Potential short talk presenters will be solicited through the same abstract submission
Industry live demo will be available at the workshop
Wearable knee exoskeleton for locomotion
CV-based human pose estimation
Morning Sessions
8:30 Welcome and workshop overview
8:40 Seminar Talk: Lorenzo Masia, Technical University of Munich
9:00 Seminar Talk: Hanbyul Joo, Seoul National University
9:20 5 Short Lightning Talks (from abstract submissions)
10:00 Coffee Break and Networking Session
Live demos from the lightning talk presentations
11:00 Seminar Talk: Maurice Fallon, University of Oxford
11:20 Seminar Talk: Antonino Furnari, University of Catania
11:40 Industry Demo (Skip, wearable robotics)
12:00 Industry Demo (Meshcapade, computer vision)
12:20 Lunch and Networking Session
Afternoon Sessions
13:40 Seminar Talk: Kyu-Jin Cho, Seoul National University
14:00 Seminar Talk: Muhammed Kocabas, Meshcapade
14:20 Seminar Talk: István Sárándi, University of Tübingen
14:40 Seminar Talk: Yuting Ye, Meta
15:00 Coffee Break and Networking Session
Quest 3 motion matching algorithm demo by Yuting Ye
16:00 Panel Discussion (practicalities and future direction CV-based wearable robot control)
17:00 Concluding Remarks (award for best presentation)
In humans, vision plays a critical role in adapting motor strategies to changing environmental contexts. Translating this principle to wearable robotics, this talk explores how computer vision, particularly real-time scene understanding and object recognition, can be leveraged to enable context-aware control in soft wearable assistive robots. These systems are designed to move beyond reactive or fixed-pattern assistance by dynamically modulating support based on the user's surroundings, movement intent, and task demands. At the core of this approach is a shared control framework where wearable robots interpret motion data from the user merged with artificial vision information to cooperatively deliver tailored assistance in response to terrain changes or object interactions. The talk will present recent advances in both upper and lower limb exosuits, demonstrating how physics-informed and machine learning-based vision systems enable lightweight, efficient, and embedded control solutions. Experimental results highlight significant improvements in metabolic efficiency, muscle activity reduction, and task adaptability during real-world locomotion and object manipulation tasks. This work points toward a new generation of intelligent wearable robots designed for enhancing human performance in daily life for wellness and industrial applications.
Self-balancing exoskeletons are a key enabling technology for individuals with mobility impairments. While the current challenges focus on human-compliant hardware and control, unlocking their use for daily activities requires scene perception. In this talk I will present Exosense, a vision-centric scene understanding system for self-balancing exoskeletons. We introduce a multi-sensor visual-inertial mapping device as well as a navigation stack for state estimation, terrain mapping, and long-term operation. We tested Exosense - attached to both a human leg and Wandercraft’s Personal Exoskeleton - in real-world indoor scenarios. Experimental results will demonstrate Exosense reconstructing terrain as the Exo explores a building as well as enabling visual localisation in a previous map.
Individuals with motor impairments need soft wearable robotic hands to assist with daily tasks, and these systems typically operate by interpreting user intentions. This talk explores a new paradigm in intention detection for wearable hand robots, leveraging vision-based approaches to bridge the gap between human intent and robotic assistance. We first introduce VIDEO-Net, a deep learning model that predicts user intention by analyzing spatial and temporal visual cues from a first-person- view camera, enabling soft robotic hands to respond seamlessly and without requiring calibration or additional user input. Expanding on this foundation, our second study extends intention detection to bimanual tasks, allowing individuals with spinal cord injuries to perform complex activities of daily living, such as preparing a meal, using bimanual hands. We hypothesize that the intentions underlying bimanual hand movements can be effectively inferred by classifying a set of fundamental low-level actions —such as grasping, pressing, tightening, squeezing and releasing. The third study targets cluttered environments common in occupational therapy, such as the peg-and-hole and box-and-block tests. We present a vision-based system that uses kinematic features to detect not only grasping intentions but also intermediate pregrasp postures—crucial for stable and precise object manipulation in stroke rehabilitation. Our findings demonstrate its effectiveness in structured therapy scenarios and highlight its applicability for goal-oriented pick-and-place tasks relevant to occupational therapy in daily living scenarios.
Assistant Professor
Department of Computer Science and Engineering
Seoul National University
Towards Capturing Everyday Movements to Scale Up and Enrich Human Motion Data
Equipping AI and robotic systems with the ability to understand human behavior in everyday life is essential for enabling them to better assist people across a wide range of applications. However, the availability of high-quality human motion data for learning such knowledge remains extremely limited. In this talk, I will present our lab’s efforts to scale and enrich 3D human motion datasets by capturing everyday human movements and natural human-object interactions. I will first introduce ParaHome, our new multi-camera system designed to capture human-object interactions in natural home environments. Next, I will present MocapEvery, a lightweight and cost-effective motion capture solution that uses two smartwatches and a head-mounted camera to enable full-body 3D motion capture across diverse settings. Finally, I will discuss our recent work that enables machines to model comprehensive affordances for 3D objects by leveraging pre-trained 2D diffusion models, allowing for unbounded object interaction capabilities.
Sensing your moves from smart wearable devices
Consumer wearable devices such as VR headsets, smart glasses, watches, and even AirPods are now packed with sensors and onboard compute, unlocking new possibilities for egocentric full-body motion tracking, synthesis, and understanding. In this talk, I will first present results in real-time full-body motion reconstruction from VR devices, comparing physics-based simulation with data-driven motion synthesis approaches. Next, I will share preliminary findings using only the Aria smart glasses (https://www.projectaria.com/glasses/), highlighting the importance of environmental context in motion understanding under limited sensing. Key challenges, including minimal sensing, real-time constraints, integration of multi-modal data, and dataset availability, will be discussed, along with opportunities for future research and impactful applications. Together, these advances are ushering in a new era of context-aware physical intelligence, where always-on wearables seamlessly integrate into daily life to deliver truly personalized and immersive user experiences.
In this talk I will do a technical deep dive into Mocapade: Meshcapade's markerless motion capture solution from any video. Mocapade is based on our CVPR 2025 paper: PromptHMR Promptable Human Mesh Recovery. PromptHMR reimagines human pose and shape estimation through the integration of spatial and semantic prompts. Unlike existing methods that either sacrifice scene context when working with cropped images or lose accuracy when processing full scenes, PromptHMR achieves both contextual awareness and high precision. I will demonstrate how our system leverages multiple input modalities—from facial bounding boxes in crowded scenes to natural language descriptions of body shape—to achieve state-of-the-art performance across challenging scenarios.
Assistant Professor
Department of Mathematics and Computer Science
University of Catania
Egocentric Vision: An Anticipatory Sensor for Wearable Robots - From Action Forecasting to Procedural Understanding
Egocentric vision puts a head‑mounted camera where a wearable robot most needs eyes—next to the person it assists. From this first‑person view, we can predict which object will be touched next, trace the steps of an unfolding task, and detect when an action drifts off script. I will present our lab’s progress toward these abilities, covering next‑active‑object prediction, task‑graph learning, and gaze‑guided error detection. Finally, I’ll discuss how egocentric vision needs to move towards online, streaming, and ultimately real-time processing to support real-world applications in wearable assistive systems and beyond.
Reconstructing 3D humans from single-image observations is critical for many robotics applications. The recent rise of larger datasets and foundation models raises the question of how to best leverage these resources to achieve new heights of reconstruction quality. I present two of our group’s recent works aimed at this challenge on two levels of abstraction. In the first work, Neural Localizer Fields (NLF), we formulate body pose and shape estimation pointwise in correspondence to a continuous canonical template. This allows us to merge 50 heterogeneous large-scale datasets that use different human representations, and to train a state-of-the-art generalist model that can trace out the posed body shape point-by-point in any landmark or mesh format. Next, in Human-3Diffusion, we tackle template-free 3D clothed avatar reconstruction, by combining the realistic appearance priors of 2D image diffusion models with the explicit 3D consistency of Gaussian splatting. Finally, I discuss the pros and cons of these different levels of image-based 3D human reconstruction, including remaining open challenges.