Although in GROW tasks, the best Transformer-based baseline could achieve same-level performance as our method, it is highly oscillated and unstable. In most tasks, the baselines did not perform well.
For each GNN and Transformer based method, we implement a small version (with 16 segments) and a large version (with 64 segments). At every step, the robot is firstly segmented by unsupervised method K-means. Then, we use the state information of every mass center points to establish the graph. Transformer-based methods implement the same segmentation operation, but treat these segments as sequences rather than graph joints. The overall curves and demos for best performance baseline could be viewed below:
RUN (X4 Speed)
OBSTACLE (X4 Speed)
GROW (X4 Speed)
SLOT (X4 Speed)
Figure 1.1 Illustration of modular-based method.
Figure 1.2 Comparison with modular-based control algorithms.
Figure 2 The framework of CFP.
As depicted in the paper's section 4.2, the policy model’s structure is a fully convolutional framework. The input of the model is a multi-channel state image (64x64x3), which contains the information of the robot's shape and velocities (from x and y direction).
As depicted in the paper's section 5.2, the output of the model is an action image (8x8x2 or 16x16x2), representing the strength in the action field. After we get the action, we will bicubically upsample it to the same-size, same-place image as the state image.
The illustration is shown in Figure 2: (in paper’s Fig 3., we used animation to substitute the real input/output). After output the residual part. Then, we could follow the equation in section 4.2 to calculate the final output (the coarse action will be firstly upsample to 16x16x2, then added with residual action, and finally upsample to 64x64x2 to output the fine signal).
In our project, we employ a grid with a resolution of 64x64x2 to store the action information. At every time step, the process involves two key steps:
1. Upsampling Coarse Action:
The coarse action, represented at a resolution of 8x8x2, is first upscaled to match the 64x64x2 grid. This upsampling is a vital step in preparing the action for distribution.
2. Distributing Signals to Particles:
Once upscaled, these action signals are then distributed to the individual particles in MPM’s grid operation stage. This distribution is a critical component of how the robot receives and reacts to action commands.
The episode reward is the most important metric to evaluate the performance of robot control algorithms, which is mostly used.
However, we compared the running speed and the usage of memory of the proposed method with baselines in the RUN task. The results indicate that our method is a faster and more lightweight method. We also report the success rate of some manipulation or reaching tasks (GROW and SLOT) whose success can be clearly defined. The results indicate that our method outperforms others.
Figure 3 More metrics' comparison with baselines
In our method, the agent is trained using SAC, which output is the sample from a gaussian distribution. (i.e. At every step, given the state input, the agent's policy will output two vectors: Mu and Std (dims=action dims), then the final output action vector will be the sample from these n-dimensional gaussian distributions (non-correlated with each other). Thus, the proposed method should be adapted to the noise actuators.
We conduct experiments to validate our points. In this situation, we implement a random zero-mask to either the observation or action tensors to mimic the failure mode of the real-world situation. The ratio of random zero-mask is 20 percent.
If we set the average performance (totally 20 episodes) of the none-failure model to be 1, then the ratio average performance of the observation failure and action failure model is shown in the bar-plot. As we can see, due to the noise in observation or action space, the average episode rewards are lower than the non-failure cases. However, in most tasks, the model is still functional. Compared to observation, action space is more robust to the failure case.
Figure 4 Robustness justification in episode reward metric.