Waseda University
(*equal contribution)
As the global population continues to age, a shortage of caregivers is expected in the near future. Dressing assistance plays a vital role in supporting daily living and social participation.
However, assisting with close-fitting garments such as socks remains particularly challenging for robots, as it requires delicate force control to manage friction and snagging against the skin, while accurately considering the shape and position of both the garment and the human body.
To address these challenging, we propose a multimodal sock-dressing assistance method that integrates visual, proprioceptive, and tactile information with semantic-based visual attention. The proposed approach enables adaptive and safe interaction across individual differences and unseen environments.
Experiments using the bimanual humanoid robot Dry-AIREC demonstrate successful sock dressing on human participants, highlighting the effectiveness of the proposed model for close-contact dressing assistance.
Objective and Approach
The objective of this work is to enable robust motion generation for robot-assisted sock dressing that can adapt to individual differences in human feet and operate reliably in unseen environments.
Dressing close-fitting garments such as socks is particularly challenging due to friction, snagging, and the need for precise force control while maintaining safety for the human user. These difficulties are further amplified by variations in foot size, shape, flexibility, and appearance, as well as changes in background and environmental conditions.
Our goal is to develop a motion generation framework that estimates the state of both the foot and the garment and generates appropriate dressing motions that generalize beyond the training conditions.
To achieve this goal, we propose a multimodal imitation learning framework that integrates visual, force, and tactile information.
Instead of relying solely on RGB images, we incorporate semantic segmentation to extract object-level information of the foot and the sock, together with monocular depth estimation to infer their three-dimensional spatial relationship. This semantic and depth-aware perception enables robust state estimation that is less sensitive to variations in appearance and background.
The extracted visual attention points, along with joint angles, joint torques, and tactile feedback from the robot’s fingers, are processed by a hierarchical LSTM-based predictive model. This model captures temporal dynamics and interaction forces during dressing, allowing the robot to generate adaptive motions with appropriate force direction and magnitude.
For safety and data efficiency, the model is trained using demonstrations collected on a mannequin and then applied to real human subjects, demonstrating strong generalization to unseen foot sizes and environments.
Collecting Training Data with Teleoperation
Using a PS5 controller, robot motions were taught with soft impedance applied to joints with potential human contact, while others requiring force were position-controlled for precise execution.
To improve generalization in teleoperation, we employed two distinct movement patterns.
First, synchronous arm movements were used to ensure high reproducibility across trials.
Second, alternating arm movements were introduced to capture dynamic responses arising from friction between the sock and the skin.
These complementary movement patterns enable a more comprehensive evaluation of teleoperation performance under varying interaction conditions.
The demonstration angles are varied 30° 40° 50° to teach diverse trajectories and improve robustness to positional differences.
Proposed Model for Estimating Foot and Garment States Using Semantic Visual Attention
Motion generation is based on EIPL with a hierarchical LSTM. Semantic information and 3D object states are extracted from images, and CNN features from semantic masks and depth are processed by Spatial Softmax to obtain visual attention points. These attention points, together with joint angles, torques, and tactile data, are fed into a hierarchical LSTM to predict the next-step image and joint angles.
Semantic-based Visual Attention with Semantic–Depth Integration
This figure illustrates the semantic-based visual attention mechanism during sock-dressing.
Semantic masks identify task-relevant regions—the foot and the sock—while suppressing background information, and depth is estimated from the same visual input. These are integrated to form object-centric 3D representations, enabling attention to consider both object location and spatial relationships in depth.
The attention points move dynamically over the object area as the task progresses. Blue points indicate current visual attention key points, while red points represent predicted attention points for the next time step.
By predicting future attention locations using semantic and depth-aware features, the model captures the temporal evolution of object interaction, enabling stable motion generation under occlusion, and background changes.
Result and Discussion
The ablation study clearly demonstrates the importance of each component in the proposed model.
The full model achieved a 100% success rate in sock-dressing tasks. When individual components were removed, performance degraded in characteristic ways
Removing the Depth Anything Model (DAM) reduced the success rate to 85%, indicating that depth information plays a key role in understanding the spatial relationship between the sock and the foot. Failure cases showed that, without depth cues, the robot occasionally misjudged vertical alignment, causing the sock to catch on the toes during downward motion.
Excluding SKNet, which provides somatosensory attention, also led to a moderate drop in performance (80% success rate). This suggests that tactile-based feature selection is important for adapting force interactions during phases such as passing the heel.
In contrast, removing the hierarchical LSTM caused a dramatic failure (5% success rate), highlighting the critical role of temporal modeling for maintaining consistent motion over long dressing sequences. When both semantic segmentation and depth estimation were removed, the task failed entirely (0% success), confirming that semantic-based visual attention is essential for robust perception beyond raw RGB information.
Overall, the ablation results indicate that stable sock dressing requires the combined use of semantic perception, depth-aware attention, tactile sensing, and temporal prediction.
Human Subject Experiments: Individual Differences and Environment Robustness
※Success: Smooth transition from toes to ankle via the heel.
Failure: Catching at the toes or failure to reach the ankle.
To evaluate generalization, the model trained solely on mannequin data was tested on 10 human participants with foot sizes ranging from 23.0 cm to 26.5 cm, under both seen (trained) and unseen (untrained) background conditions
The proposed method achieved a success rate of 84% (42/50) in known environments and 74% (37/50) in unseen environments. In contrast, Action Chunking with Transformer (ACT) dropped from 66% in known backgrounds to 0% in unseen backgrounds, while Diffusion Policy (DP) failed to complete the task in all conditions.
Further analysis showed that the proposed method maintained stable tactile force profiles across different foot sizes, whereas ACT exhibited sharp force spikes, especially around the heel, indicating snagging or misalignment. Visual attention points consistently tracked the sock and foot regions even under background changes, demonstrating robustness to visual variation.
Statistical analysis confirmed a significant performance difference between the proposed method and ACT (p < 0.01), supporting the effectiveness of semantic-based visual attention and multimodal integration for handling individual differences.
Limitation:
Despite robust performance, several limitations remain.
・The toe-insertion phase was not rigorously evaluated and remains less reliable than later stages.
・Unexpected events such as socks catching on toenails occasionally required manual intervention, indicating the need for online motion replanning.
・The participant pool was limited in size and did not include subjects with severe involuntary movements.
Motion generation accounting for individual differences and unseen environments, compared with a baseline model
ACT : Seen Background
ACT : Unseen Background
Ours : Seen Background
Ours : Unseen Background
This work was supported by JST Moonshot R&D, Grant No. JPMJMS2031.
BibTex :
@ARTICLE{11395615,
author={Tsukakoshi, Takuma and Miyake, Tamon and Ogata, Tetsuya and Wang, Yushi and Akaishi, Takumi and Sugano, Shigeki},
journal={IEEE Robotics and Automation Letters},
title={Close-Fitting Dressing Assistance Based on State Estimation of Feet and Garments With Semantic-Based Visual Attention},
year={2026},
volume={11},
number={4},
pages={3923-3930},
keywords={Clothing;Foot;Force;Robots;Semantics;Adaptation models;Shape;Feature extraction;Skin;Friction;Assistive robots;computer vision;human-robot interaction;imitation learning;robot sensing systems},
doi={10.1109/LRA.2026.3664535}}