We propose ECO (Energy-Constrained Optimization), a constrained RL framework that separates energy-related metrics from rewards, reformulating them as explicit inequality constraints. We evaluated ECO against MPC, standard RL with reward shaping, and four state-of-the-art constrained RL methods (see below). Experiments, including sim-to-sim and sim-to-real transfers on the kid-sized humanoid robot BRUCE, demonstrate that ECO significantly reduces energy consumption compared to baselines while maintaining robust walking performance. This is, to the best of our knowledge, the first work to achieve energy-efficient humanoid walking using constrained RL on a real humanoid robot
To evaluate the robustness and generalizability of the learned policy, we transferred it to two additional simulation environments, MuJoCo and Gazebo, to assess ECO's sim-to-sim performance and benchmark it against baseline methods. The motor energy consumption is compared in Gazebo when BRUCE is walking at 0.1m/s over a 10s period against baselines, where ECO reduces energy use by approximately 2.3 times compared to MPC and 1.4 times compared to PPO.
BRUCE successfully rejects disturbance and traverses outdoor terrains.
The policy network, taking velocity commands and proprioception data as inputs, outputs desired joint positions at 100HZ to a PD controller, which updates torque commands at 1000HZ. The reward critic is trained with privileged observations. The simulator provides the reward, energy cost, and symmetry cost, which are used to compute the reward and cost returns. The policy is then updated using the Lagrangian method to balance rewards and costs. The trained policy is directly deployed to the real world.