Force Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing
Zhanyi Sun∗ , Yufei Wang∗ , David Held† , Zackory Erickson†
Abstract
Robot-assisted dressing could profoundly enhance the quality of life of adults with physical disabilities. To achieve this, a robot can benefit from both visual and force sensing. The former enables the robot to ascertain human body pose and garment deformations, while the latter helps maintain safety and comfort during the dressing process. In this paper, we introduce a new technique that leverages both vision and force modalities for this assistive task. Our approach first trains a vision-based dressing policy using reinforcement learning in simulation with varying body sizes, poses, and types of garments. We then learn a force dynamics model for action planning to ensure safety. Due to limitations of simulating accurate force data when deformable garments interact with the human body, we learn a force dynamics model directly from real-world data. Our proposed method combines the vision-based policy,trained in simulation, with the force dynamics model, learned in the real world, by solving a constrained optimization problem to infer actions that facilitate the dressing process without applying excessive force on the person. We evaluate our system in simulation and in a real-world human study with 10 participants across 240 dressing trials, showing it greatly outperforms prior baselines.
Method
Fig 1. Method Overview
We propose a new method for the task of assistive dressing, named Force-Constrained Visual Policy (FCVP), shown in Fig 1. Our method elegantly handles the case in which only the visual modality (point clouds) can be simulated sufficiently accurately to be transferred to the real world, but the force modality cannot. Our key idea is to use a vision-based policy trained in simulation to propose actions, and then to use a force-based dynamics model trained in the real world to filter out unsafe actions. Our system is comprised of two parts. First, we leverage a vision-based policy from prior work which is trained in simulation using reinforcement learning. By using simulation, we are able to train a single policy that can generalize to many variations of human arm poses, body shapes, and garments. To ensure safe assistive dressing, we learn a force dynamics model which predicts the future forces applied to the human. The force dynamics model is trained directly from the real world data, due to the fact that many simulators do not provide sufficiently accurate force simulation for deformables manipulated around the human body or other objects. At test time, the final robot action is inferred by solving a constrained optimization problem that combines the vision-based policy and the force dynamics model.
Real World Experiments
For our real world human study, we recruit 11 participants and test two garments and three poses, shown by the figure below. In addition to measuring the dressed arm ratio and force applied to human arm, we present participants with 7-point Likert items that range from 1=‘Strongly Disagree’ to 7=‘Strongly Agree’ with the following statements:
1. The robot successfully dressed the garment onto my arm.
2. The force the robot applied to me during dressing was appropriate.
3. The dressing process was comfortable for me.
We compare FCVP (ours) with Vision Only baseline, in which we run the vision-based policy without force information.
Fig 2. Real robot experiment setup
Quantitatively, our proposed method outperforms all baselines in terms of the dressing performance, the force applied on human's arm, and the participants' responses to our survey questions. As shown in Tab 1, FCVP not only achieves higher arm dressed ratio, but also has significantly lower force violations when compared against the Vision Only baseline and Vision w/. Random Action baseline.
Tab 1. Arm dressed ratio and average force violation of all method
Fig 3. Density plot of force distribution (left), box plot of the force distributions (middle), and Likert item responses from participants (right)
The left and middle sub-figures are the density plot and box plot of the force distributions on all participants in the human study. In both sub-figures, the dashed black line represents the force threshold. FCVP (Our) greatly reduces the force violation compared to the baselines. The rightmost sub-figure is the Likert item responses from all 10 participants. FCVP achieves statistically significant differences from both baselines with higher reported scores for all 3 Likert items (if the dressing if successful, if the force is aooropriate, and if the dressing is comfortable).
Real Robot Videos
The videos on the left side show the Vision Only baseline and the videos on the right side show our proposed method. The gifs located at the lower right corner show the running force of the dressing trials. The orange line is the actual force per time step and the green line is the force threshold. The Vision Only baseline causes the garment to get caught on the person's arm and exerts large force, while FCVP dresses the garment onto the person's arm smoothly and safely with force below the threshold.
*All videos are sped up by 4 times.
Vision Only baselineCoRL 2023 Anonymous Submission
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Vision Only baselineC
FCVP (Ours)
Additional Experiment Results
More results on force distributions in sim2sim transfer experiments
The figure below provides more detailed analysis of the force distributions of all compared methods. The left shows the box plot of force distributions on all human meshes, arm poses, and garments of all compared methods, and the right shows the averaged forces along the dressing trajectories (as a function of the dressing timestep). As shown, FCVP achieves the smallest force violation amount compared to all baselines. We note that FCVP is trying to minimize the force violation amount, instead of the force itself, therefore although the median force of the FCVP is slightly larger than the multmodal safe RL and the force only baselines, the variance of the force distribution of FCVP is much smaller, and the overall violation amount is also smaller.
Fig 4. Box plot (left) and density plot (right) of force distributions on all human meshes, arm poses, and garments.
Generalization of force dynamics model to different shape and size of human arms
The input to the force dynamics model includes the partial point cloud of the scene, which captures the shape and size of the human arm, so the force dynamics model should be able to generalize to the shape and size of the human arm within the training distribution.
To provide further analysis on this, we have plotted the training (collected from 11 participants) and evaluation (collected from another 10 participants) distributions of the participants’ forearm and upper arm lengths in our human study, as shown in the Fig 5 and Fig 6 below.
Fig 5. Training and evaluation forearm length distribution
Fig 6. Training and evaluation upper arm length distribution
In Fig 5 and Fig 6, the shaded region shows the density plot of the forearm and upper arm length distributions. For the upper arm length, there is a noticeable distribution shift between the training and evaluation data. There is a noticeable distribution shift between the training and evaluation data, especially for the upper arm length.
In addition, we summarize the training and evaluation error of the force dynamics model in the Tab 2 and Fig 7 below. Tab 2 shows the average training and evaluation error in L1 loss, with the unit of Newtons. Fig 7 plots the training and evaluation error in L1 loss versus forearm length and upper arm length, respectively.
Tab 2. L1 loss of training and evaluation error of the force dynamics model
Fig 7. Forearm length (left) / upper arm length (right) v.s. force dynamics model prediction error
Due to the limited amount of training data we have and the distribution shift between the training and evaluation data (as shown in the plot above), we do observe a train-eval gap in the force dynamics model’s prediction error, measured by the L1 loss in the table and the figures. Despite this train-eval gap, the evaluation prediction error is still very small (less than 0.1 Newton), and the force dynamics model still proves to be useful in reducing force violations in our human study, as shown in Table I of the revised paper (Table II in the submitted version). These results indicate that the force dynamics model can generalize reasonably well to the shape and size of the human arms. Interestingly, we do not observe a strong correlation between the evaluation errors and the locations with either more or less training data. We hypothesize that the generalization gap might be influenced by variables other than the arm lengths, such as the exact geometry of the human’s arm. We believe the gap can be narrowed in future work with regularization, early stopping, or by collecting more data. Another way to further address the generalization issue could be to collect a small amount of dressing trials on each new user, and fine-tune the dynamics model on them to make the model generalize to their specific arm shape and size. We leave this as interesting future work.
Generalization of the force dynamics model towards the clothing the users wear
The properties of the cloth the users wear, such as friction and elasticity, would affect the forces applied to the users during the dressing process. For example, a long-sleeve T-shirt usually has higher friction than direct human skin; therefore the same dragging motion from the robot might result in higher forces when dressing the user on top of a long sleeve T-shirt compared to dressing on top of direct human skin. This will affect the ground-truth force labels used to train the force dynamics model.
However, the force dynamics model takes as input a history of past forces, which is affected by the friction values caused by the clothes the users are wearing. For example, a larger series of forces in the history might indicate that the dressing is happening on top of high-friction surfaces such as long-sleeve T-shirts, and thus the future predictions should have higher force magnitudes. This notion of using a history of sensor readings to implicitly enable the model to generalize to different environment properties has been explored in other works as well [1]. We also want to note that the generalization ability achieved in such a way is likely to be limited, and we leave more robust generalization towards these environmental factors as important future work, e.g., by leveraging adaptation methods such as those proposed in [1]”
[1] Lee et al, Learning quadrupedal locomotion over challenging terrain, Science Robotics, 2020.
Optimality and inference time for solving the optimization problem in the real world
There can be cases where there is no action whose predicted force is below the threshold. We have computed quantitative metrics to analyze how often this happens during our human study, calculated as the ratio between the number of timesteps where there is no action whose predicted force is below the threshold, and the number of total timesteps. We found the ratio to be low: 4.028%, showing that such a case happens very infrequently during the human study. The ratio can be further lowered by sampling more actions when solving the constrained optimization problem. The average inference time when solving the optimization problem in the real-world experiments is 0.065 seconds per timestep.
Generalization of force dynamics model to garments with different qualities
In our simulation experiments, we trained the force dynamics model on 5 garments, including a hospital gown and 4 different cardigans, with different densities and geometries. We found that the dynamics model performed well on all these 5 different garments. This empirically verifies that the force dynamics model can generalize to multiple (more than 2) garments that differ in qualities such as densities and geometries.
We have also added new experiments in simulation to test the ability of our model to generalize to variations in roughness and elasticity. In this experiment, there are 50 clothes with the same geometry but different roughness & elasticity. Different roughness & elasticity is achieved by randomly sampling the spring coefficients of these garments within a range of [0.3, 1.5]. We then train a force dynamics model on 40 clothes and test on the remaining 10. The results are shown in the Tab 3 below. As shown below, the force dynamics model generalizes reasonably well to the clothes with unseen roughness & elasticity: the average force violation increased a bit compared to that on the training clothes, but is still much lower than the vision-only baseline. We leave the study for more robust generalization ability towards different garment qualities as future work.
Tab 3. Performance on training and evaluation elasticity & roughness
Force dynamics model input design decision
Intuitively, we treat the force prediction problem as Markovian, i.e. the future force only depends on the current state and the current action, and this is the reason that past actions are not included in the force dynamics model input. We have also added simulation experiments to verify this, where we train the force dynamics model with a different length of past actions as part of the inputs. The results are summarized in Tab 4 below. As shown, performances with different lengths of past actions are similar, but we find the best performance when not providing past actions to the model.
Tab 4. Performance with varying number of past actions as part of input to force dynamics model
We perform an ablation study to investigate how the number of past force measurements N affect the performance of FCVP. The results are shown in Tab 5 below. As shown, a larger number of past force measurements N leads to fewer force violations yet a lower dressed ratio. With a longer history N, the force dynamics model can be trained to be more accurate, resulting in fewer force violations for planning. However, this accuracy comes at the cost of a smaller set of viable actions for our policy to choose, thus leading to lower dressed ratios. We use N=5, which achieves a good trade off between these two objectives.
Tab 5. Performance with varying number of past forces as part of input to force dynamics model
Appendix Materials
Detailed reward function for training the vision-based policy in simulation
The reward function utilized in simulation to train the vision-based policy remains the same as that used in the prior work [1]. The full reward function we use is r=r_m + r_p + r_c + r_d. We expand each term as follows:
r_m measures the progression of the dressing task, which is the dressed distance along the simulated human's arm (see Wang et al. for more details).
r_p is a penalty term that helps avoiding unrealistic simulation in which the cloth penetrates through the arm when the simulation force is too large. Specifically, let f be the total applied force from the garment to the human in simulation, and f_{max} be the max threshold that will cause the cloth to penetrate the simulated human, this reward is computed as r_p=-0.001 \max(f-f_{max}, 0). We set f_{max} to be 1000 units as measured by the simulator, which is a threshold value empirically determined to be when the cloth starts penetrating into the simulated human. We note that due to simulation modeling inaccuracies, this number given by the simulator does not correspond to 1000 Newtons of force in the real world.
The third term r_c is a contact penalty term that prevents the robot end-effector from moving too close to the simulated human's arm. Let d_e be the shortest distance between the end-effector and the points of the arm, this reward term is computed as r_c = −0.01 * I(d_e < d_{\min}), where I(d_e < d_{\min}) = 1 if d_e < d_{\min} and 0 otherwise. d_{\min} is set to be 1cm in our experiments.
The last reward term r_d is a deviation penalty that discourages the sleeve from moving too far away from the simulated human's arm. Let d_g be the shortest distance from the center of the shoulder part of the garment to the points of the arm, this reward is computed as r_d = 0.02 if d_g < 3 cm, r_d = −0.05 if d_g > 7.5 cm, and r_d = 0 otherwise.
[1] Y. Wang, Z. Sun, Z. Erickson, and D. Held. One policy to dress them all: Learning to dress people with diverse poses and garments. In Robotics: Science and Systems (RSS), 2023
Detailed simulation parameters for the sim2sim transfer experiments
The values of parameters we used for simulation A and simulation B are shown in the table above. By using different simulation parameters, the simulated force readings between simulation A and simulation B are very different, approximating the sim2real gap.
More details on the compared baselines
Vision only baseline [1]: We obtain the code from the authors of [1]. We followed all hyper-parameter settings in the original paper.
Force only baseline [2]: Following Erickson et al, actions are heuristically sampled within a forward task progression cone formed by the human's arm during data collection. The force dynamics model d_\psi(o, F, a) is trained exactly the same way as our method. During test time, we plan actions according to a cost function that encourages lower force applied to the human arm and higher dressing performance. Formally, the cost function is represented by three weighted terms as follows:
where d is the vector from human's finger to elbow when the garment is still on forearm and from human's elbow to shoulder when the garment is on the upper arm. We use w_1 = 0.001, w_2 = 1, and w_3 = 0.1.
Multimodal Policy baseline: For this baseline, we add the force history as part of the input to the actor and critic. The force history is concatenated with the point cloud latent features that are encoded by a PoineNet++ encoder. We use force history length of 5. An MLP takes the concatenated feature vectors and produces the action. We train the multimodal policy to convergence in sim A using the task reward, and then transfer and fine-tune it in sim B. To encourage the policy to reduce force in sim B, we add a force penalty term to the reward function when fine-tuning the policy in sim B. The overall reward is written as r_t' = r_t - w * \max (0, f_t-\tau), where r_t is the reward at time step t as described above in the detailed reward function, f_t is the force at time step t, \tau is the force threshold, and w is the weight given to the force penalty.
Force Residual Policy baseline: For this baseline, we first train a base policy that only takes in visual observation in sim A using RL. Then in sim B, we train a separate force residual policy using RL. For the force residual policy, it takes as input the point cloud, the force history, and the base action outputted by the base vision policy. The point cloud is encoded into a latent vector using PointNet++, and both the force history and the action output from the base vision policy are concatenated with the point cloud latent vector. An MLP takes the concatenated feature vector and produces the residual action. The residual action output from the force residual policy is added to the action output from the base policy as the final action. The reward for training the Force Residual Policy is the same as Multimodal Policy.
Multimodal Safe RL baseline: For this baseline, we use SAC-Lagrangian. We use the implementation from [3]. We set the constraint to be f_t = 40. Similar to Multimodal Policy baseline, we also add the force history as part of the input to the actor and critic for this baseline. We train the multimodal policy to convergence in sim A using SAC-Lagrangian, and then transfer and fine-tune it in sim B, also using SAC-Lagrangian.
[1] Y. Wang, Z. Sun, Z. Erickson, and D. Held. One policy to dress them all: Learning to dress people with diverse poses and garments. In Robotics: Science and Systems (RSS), 2023
[2] Z. Erickson, H. M. Clever, G. Turk, C. K. Liu, and C. C. Kemp. Deep haptic model predictive control for robot-assisted dressing. In 2018 IEEE international conference on robotics and automation (ICRA), pages 4437–4444. IEEE, 2018
[3] Liu, Z., Cen, Z., Isenbaev, V., Liu, W., Wu, S., Li, B. and Zhao, D., 2022, June. Constrained variational policy optimization for safe reinforcement learning. In International Conference on Machine Learning (pp. 13644-13668). PMLR.
Human Study Experiment Details
Trial Procedure
We follow prior work [1] for conducting each trials as follows.
We first ask the participant to lift their right arm to maintain the pose they should imitate. We then record a point cloud of the participant's right arm. We then move the Sawyer's end-effector, which is already holds the garment, to be positioned near the participant's hand. We capture the point cloud of only the garment using color thresholding. The participant holds their arm steady throughout the trial. The static arm point cloud, the garment point cloud, and robot end-effector position, are used as input to the vision-based policy. We run each trial to a fixed time step unless the participant wishes to have the trial stopped, or the perceived force is above a safety threshold (15 Newtons). After the trial terminates, we measure dressing distances to compute the whole/upper arm dressed ratio. At the end of each trial, we provide the participant with the 7-point Likert item statements.
Script to Participants
Before the study, we read the following script to the participants to explain to them the study procedure:
We are conducting a study to evaluate a robot-dressing system. The robot will dress the garment on your right arm.
We will now walk you through the steps we are taking.
You will first read and sign the consent form, and fill in a demographic form.
We will then measure some statistics of your arm, including the forearm length, upper arm length before the study starts.
We will put a marker on your shoulder for the experiment.
We will then start the study. There will be a total of 24 dressing trials.
The study will be divided into two parts. The first part has 8 trials and the second part has 16 trials. After the first part is complete, we will need 10 minutes to prepare for the second part. You can take a break at that time.
During each trial, you will be asked to hold a certain arm pose. We will test two garments and two poses during the study and we will alternate between those poses. We will show you the pose you need to hold on the screen in front of you and also demonstrate the pose by ourselves.
You may not move once the trial starts. In each trial, we will first capture the image of your right arm, and we will move the robot closer to your hand (you may not move during the process). Then we will start the dressing, which will last 40 seconds to 1 minute.
During dressing, the robot might pull you or tug you. In those cases, try your best not to move. If the force is too large or you feel uncomfortable, we will terminate the trial immediately.
Occasionally there will be operation failures on us for a trial. We will repeat those trials when such failures happen.
After each trial, please hold the arm static for another few seconds while we measure some statistics of the dressing performance. Then you can rest the arm and you will be asked to fill in three questions. The questionnaire is a likert-scale item with 3 questions. 1 is strongly disagree, 7 is strongly agree.
You can rest whenever you feel tired, just let us know.
Simulation arm pose regions
The above figures show the 4 arm pose regions we tested in simulation. The first has the elbow extending out, the second has the shoulder bending down, the third has the elbow bending down, and the last has the shoulder lowering down and the elbow bending inwards.