The landscape of Deep Learning has experienced a major shift with the pervasive adoption of Transformer-based architectures, particularly in Natural Language Processing (NLP). Novel avenues for physical applications, such as solving Partial Differential Equations and Image Vision, have been explored. However, in challenging domains like robotics, where high non-linearity poses significant challenges, Transformer-based applications are scarce. While Transformers have been used to provide robots with knowledge about high-level tasks, few efforts have been made to learn dynamics or perform system identification.
This paper proposes a novel methodology to learn a meta dynamical model of a high-dimensional physical system, such as the Franka robotic arm, using a Transformer-based architecture without prior knowledge of the system's physical parameters. The objective is to predict quantities of interest (end-effector pose and joint positions) given the torque signals for each joint. This prediction can be useful as a component for Deep Model Predictive Control frameworks in robotics. The meta-model establishes the correlation between torques and positions and predicts the output for the complete trajectory. This work provides empirical evidence of the efficacy of the in-context learning paradigm, suggesting future improvements in learning the dynamics of robotic systems without explicit knowledge of physical parameters
To challenge the model's capability, datasets were generated in various compositions and topologies. The configurations of datasets is represented in the Figure below.
Each test case represents the dataset collected with 1000 robots each. The performance of the models is evaluated on both in-distribution (ID) and slightly out-of-distribution (OOD) scenarios. Although the overall performance is generally higher for ID cases, it is notable that in terms of R^2, both test_5 and test_6, characterized by a lower master frequency, show similar results. The smallest dataset, Dataset1, requires approximately an hour of training.
The context used for the prediction is 20% as shown in Figure below.
If you find our project useful, please cite us as