Po-Yen Wu, Cheng-Yu Kuo, Yuki Kadokawa, and Takamitsu Matsubara
When lifespan is not considered (Before), the learned policy does not account for structural variations across the tool, often applying stress to weaker regions and causing early failure. By integrating lifespan estimation (via FEA) into reinforcement learning, the agent receives a life reward that guides it toward structurally robust regions (After). The resulting lifespan-guided tool use policy balances task reward and lifespan reward to achieve both task success and extended tool lifespan.
In inaccessible environments with uncertain task demands, robots often rely on general-purpose tools that lack predefined usage strategies. These tools are not tailored for particular operations, making their longevity highly sensitive to how they are used. This creates a fundamental challenge: how can a robot learn a tool-use policy that both completes the task and prolongs the tool’s lifespan? In this work, we address this challenge by introducing a reinforcement learning (RL) framework that incorporates tool lifespan as a factor during policy optimization. Our framework leverages Finite Element Analysis (FEA) and Miner’s Rule to estimate Remaining Useful Life (RUL) based on accumulated stress, and integrates the RUL into the RL reward to guide policy learning toward lifespan-guided behavior. To handle the fact that RUL can only be estimated after task execution, we introduce an Adaptive Reward Normalization (ARN) mechanism that dynamically adjusts reward scaling based on estimated RULs, ensuring stable learning signals. We validate our method across simulated and real-world tool use tasks, including Object-Moving and Door-Opening with multiple general-purpose tools. The learned policies consistently prolong tool lifespan (up to 8.01× in simulation) and transfer effectively to real-world settings, demonstrating the practical value of learning lifespan-guided tool use strategies
Overview of the proposed method integrating lifespan-guided reward into reinforcement learning. During each rollout, the agent interacts with the environment, collecting state, action, force, and task reward information. At the end of each episode, the stress history is calculated via finite element analysis (FEA) and processed with Miner’s rule to estimate the remaining useful life (RUL). The RUL value is stored in a history buffer, which is used by the adaptive reward normalization (ARN) mechanism to determine dynamic upper and lower bounds. These bounds are applied to normalize the life reward for subsequent episodes, ensuring stable and meaningful reward signals for policy learning.
In the Object-Moving task, a UR5e robot with a tool pushes a cylindrical object toward a target location on a planar surface with obstacles.
In the Door-Opening task, the robot uses a tool to press down and rotate a door handle, then pull the door open to 30 degrees.
Object-Moving
Demonstration of learned policies of our proposed method on four different tools, compared to the baseline methods, highlighting both the strategy used by each policy and the corresponding stress variation, as well as the resulting tool RUL after execution.
Tool1
Ours
Baseline
Ours w/o ARN
Torque
Tool2
Ours
Baseline
Ours w/o ARN
Torque
Tool3
Ours
Baseline
Ours w/o ARN
Torque
Tool4
Ours
Baseline
Ours w/o ARN
Torque
Demonstration of learned policies of our proposed method on two different tools, compared to baseline method.
Tool1
TOOL2
Ours
Baseline
Ours
Baseline
Demonstration of learned policies of our proposed method trained in simulation, compared to the baseline method in terms of the number of trials until the tool failure.
Object-Moving
Tool1
Ours
Number of trials before tool failure: 1609
Baseline
Number of trials before tool failure: 879
Tool2
Ours
Number of trials before tool failure: Over 900 !
Baseline
Number of trials before tool failure: 244
Door-Opening
Tool1
Ours
Number of trials before tool failure: Over 2100 !
Baseline
Number of trials before tool failure: 600
Tool2
Ours
Number of trials before tool failure: 847
Baseline
Number of trials before tool failure: 471