Enhancing human mobility: From computer vision-based motion tracking to wearable assistive robot control

8:30AM ~ 5:30PM, Friday, May 23 2025

Abstract

As wearable robotic devices for human movement assistance and rehabilitation become a translatable technology to the real world, the system’s ability to autonomously and seamlessly adapt to varying environmental conditions and user needs is crucial. Lower-limb exoskeletons and prostheses, for instance, must dynamically adjust their assistance profiles to accommodate different motor activities, such as level-ground walking or stair climbing. To achieve this, it is essential not only to recognize user intentions but also to gather comprehensive information about the surroundings. Computer vision offers rich, direct, and interpretable data that surpasses non-visual sensors like encoders and inertial measurement units, making it a promising tool for enhancing context awareness in wearable robots. However, integrating computer vision into wearable robotic control poses several challenges, including ensuring the real-time feasibility of vision model outputs, maintaining model robustness in diverse mobility contexts and dynamic user movements, and effectively fusing onboard sensor data with visual information. This workshop aims to address these challenges by exploring the latest engineering solutions for computer vision-based human motion tracking and control strategies for wearable robotic systems designed to augment human locomotion. By bridging the gap between researchers in wearable robotics and computer vision, as well as between academia and industry, we seek to provide a roadmap for developing robust, adaptable, and context-aware vision-based control frameworks that can be effectively translated from the lab to real-world applications.

Workshop Abstract Submission (Deadline 4/14/25)

Submission CLOSED

We are pleased to invite a 1-page extended abstract submission for the Enhancing Human mobility: From computer vision-based motion tracking to wearable assistive robots at ICRA 2025, which will be reviewed and selected for a short lightning talk and/or a poster session.

Abstract topics of interest include all aspects of 1) computer vision-based human motion tracking and/or 2) wearable robotic system control including (but not exclusive to): human motion tracking/retargeting using computer vision, generating synthetic motion data, sensor fusion of user or environmental state estimation, robotic exoskeleton or prosthesis control, adaptive wearable robot design and control, and novel sensing methods for wearable robots.

Submission Format: Up to 2 pages excluding acknowledgement, references, and appendix. Please use the standard IEEE conference format (i.e., 2 column format)

Notification: Accepted submissions will be notified over email

If you are already presenting a work at ICRA 2025, you can upload your accepted manuscript.

Abstract Submission Link

There will be a monetary prize for the the best presentation for each topic

(Wearable Robot & Computer Vision)

Short Lightning Talk

A series of short talks given by junior researchers in the field (student or postdoctoral researcher)
5 minutes each followed by a 1-minute Q&A
To extend active discussions on relevant topics, we will hold the poster session and a short networking session

Poster Session

A small symposium where both junior and senior researchers interact by presenting work via a poster session
There will be a poster session during each coffee break
Potential short talk presenters will be solicited through the same abstract submission

Industry live demo will be available at the workshop

Skip

Wearable knee exoskeleton for locomotion

Meshcapade

CV-based human pose estimation

Workshop Schedule

Morning Sessions

8:30 Welcome and workshop overview

8:40 Seminar Talk: Lorenzo Masia, Technical University of Munich

9:00 Seminar Talk: Hanbyul Joo, Seoul National University

9:20 5 Short Lightning Talks (from abstract submissions)

10:00 Coffee Break and Networking Session

Live demos from the lightning talk presentations

11:00 Seminar Talk: Maurice Fallon, University of Oxford

11:20 Seminar Talk: Antonino Furnari, University of Catania

11:40 Industry Demo (Skip, wearable robotics)

12:00 Industry Demo (Meshcapade, computer vision)

12:20 Lunch and Networking Session

Afternoon Sessions

13:40 Seminar Talk: Kyu-Jin Cho, Seoul National University

14:00 Seminar Talk: Muhammed Kocabas, Meshcapade

14:20 Seminar Talk: István Sárándi, University of Tübingen

14:40 Seminar Talk: Yuting Ye, Meta

15:00 Coffee Break and Networking Session

Quest 3 motion matching algorithm demo by Yuting Ye

16:00 Panel Discussion (practicalities and future direction CV-based wearable robot control)

17:00 Concluding Remarks (award for best presentation)

Invited Speakers and Panelists

Lorenzo Masia

Professor

Intelligent Bio-Robotics Systems

Technical University of Munich

Context-Aware Control: Enhancing Human Performance with Soft Wearable Robotics, Machine Learning, and Vision Algorithms

In humans, vision plays a critical role in adapting motor strategies to changing environmental contexts. Translating this principle to wearable robotics, this talk explores how computer vision, particularly real-time scene understanding and object recognition, can be leveraged to enable context-aware control in soft wearable assistive robots. These systems are designed to move beyond reactive or fixed-pattern assistance by dynamically modulating support based on the user's surroundings, movement intent, and task demands. At the core of this approach is a shared control framework where wearable robots interpret motion data from the user merged with artificial vision information to cooperatively deliver tailored assistance in response to terrain changes or object interactions. The talk will present recent advances in both upper and lower limb exosuits, demonstrating how physics-informed and machine learning-based vision systems enable lightweight, efficient, and embedded control solutions. Experimental results highlight significant improvements in metabolic efficiency, muscle activity reduction, and task adaptability during real-world locomotion and object manipulation tasks. This work points toward a new generation of intelligent wearable robots designed for enhancing human performance in daily life for wellness and industrial applications.

Maurice Fallon

Associate Professor

Department of Engineering Science

University of Oxford

Exosense: A Vision-Based Scene Understanding System For Exoskeletons

Self-balancing exoskeletons are a key enabling technology for individuals with mobility impairments. While the current challenges focus on human-compliant hardware and control, unlocking their use for daily activities requires scene perception. In this talk I will present Exosense, a vision-centric scene understanding system for self-balancing exoskeletons. We introduce a multi-sensor visual-inertial mapping device as well as a navigation stack for state estimation, terrain mapping, and long-term operation. We tested Exosense - attached to both a human leg and Wandercraft’s Personal Exoskeleton - in real-world indoor scenarios. Experimental results will demonstrate Exosense reconstructing terrain as the Exo explores a building as well as enabling visual localisation in a previous map.

Kyu-Jin Cho

Professor

School of Mechanical and Aerospace Engineering

Seoul National University

Vision-based intention detection methods for wearable robotic hand

Individuals with motor impairments need soft wearable robotic hands to assist with daily tasks, and these systems typically operate by interpreting user intentions. This talk explores a new paradigm in intention detection for wearable hand robots, leveraging vision-based approaches to bridge the gap between human intent and robotic assistance. We first introduce VIDEO-Net, a deep learning model that predicts user intention by analyzing spatial and temporal visual cues from a first-person- view camera, enabling soft robotic hands to respond seamlessly and without requiring calibration or additional user input. Expanding on this foundation, our second study extends intention detection to bimanual tasks, allowing individuals with spinal cord injuries to perform complex activities of daily living, such as preparing a meal, using bimanual hands. We hypothesize that the intentions underlying bimanual hand movements can be effectively inferred by classifying a set of fundamental low-level actions —such as grasping, pressing, tightening, squeezing and releasing. The third study targets cluttered environments common in occupational therapy, such as the peg-and-hole and box-and-block tests. We present a vision-based system that uses kinematic features to detect not only grasping intentions but also intermediate pregrasp postures—crucial for stable and precise object manipulation in stroke rehabilitation. Our findings demonstrate its effectiveness in structured therapy scenarios and highlight its applicability for goal-oriented pick-and-place tasks relevant to occupational therapy in daily living scenarios.

Hanbyul Joo

Assistant Professor

Department of Computer Science and Engineering

Seoul National University

Towards Capturing Everyday Movements to Scale Up and Enrich Human Motion Data

Equipping AI and robotic systems with the ability to understand human behavior in everyday life is essential for enabling them to better assist people across a wide range of applications. However, the availability of high-quality human motion data for learning such knowledge remains extremely limited. In this talk, I will present our lab’s efforts to scale and enrich 3D human motion datasets by capturing everyday human movements and natural human-object interactions. I will first introduce ParaHome, our new multi-camera system designed to capture human-object interactions in natural home environments. Next, I will present MocapEvery, a lightweight and cost-effective motion capture solution that uses two smartwatches and a head-mounted camera to enable full-body 3D motion capture across diverse settings. Finally, I will discuss our recent work that enables machines to model comprehensive affordances for 3D objects by leveraging pre-trained 2D diffusion models, allowing for unbounded object interaction capabilities.

Yuting Ye

Research Scientist

Reality Labs Research

Muhammed Kocabas

Machine Learning Scientist

Meshcapade

MoCapade: Markerless Motion Capture from any Video

In this talk I will do a technical deep dive into Mocapade: Meshcapade's markerless motion capture solution from any video. Mocapade is based on our CVPR 2025 paper: PromptHMR Promptable Human Mesh Recovery. PromptHMR reimagines human pose and shape estimation through the integration of spatial and semantic prompts. Unlike existing methods that either sacrifice scene context when working with cropped images or lose accuracy when processing full scenes, PromptHMR achieves both contextual awareness and high precision. I will demonstrate how our system leverages multiple input modalities—from facial bounding boxes in crowded scenes to natural language descriptions of body shape—to achieve state-of-the-art performance across challenging scenarios.

Antonino Furnari

Assistant Professor

Department of Mathematics and Computer Science

University of Catania

Egocentric Vision: An Anticipatory Sensor for Wearable Robots - From Action Forecasting to Procedural Understanding

Egocentric vision puts a head‑mounted camera where a wearable robot most needs eyes—next to the person it assists. From this first‑person view, we can predict which object will be touched next, trace the steps of an unfolding task, and detect when an action drifts off script. I will present our lab’s progress toward these abilities, covering next‑active‑object prediction, task‑graph learning, and gaze‑guided error detection. Finally, I’ll discuss how egocentric vision needs to move towards online, streaming, and ultimately real-time processing to support real-world applications in wearable assistive systems and beyond.

István Sárándi

Postdoctoral Researcher

Department of Computer Science

University of Tübingen

Leveraging Large-Scale Data and Models for Image-based 3D Human Reconstruction

Reconstructing 3D humans from single-image observations is critical for many robotics applications. The recent rise of larger datasets and foundation models raises the question of how to best leverage these resources to achieve new heights of reconstruction quality. I present two of our group’s recent works aimed at this challenge on two levels of abstraction. In the first work, Neural Localizer Fields (NLF), we formulate body pose and shape estimation pointwise in correspondence to a continuous canonical template. This allows us to merge 50 heterogeneous large-scale datasets that use different human representations, and to train a state-of-the-art generalist model that can trace out the posed body shape point-by-point in any landmark or mesh format. Next, in Human-3Diffusion, we tackle template-free 3D clothed avatar reconstruction, by combining the realistic appearance priors of 2D image diffusion models with the explicit 3D consistency of Gaussian splatting. Finally, I discuss the pros and cons of these different levels of image-based 3D human reconstruction, including remaining open challenges.