Our aim here is to learn a cost function of the kind Cπ(zt πT , zgoal), where the cost accepts the states zt πT of the generated trajectory and goal state zgoal . The cost C is parametrized by learnable parameters π which are trained using bi-level optimization.
This simple cost function parametrization provides a constant learnable weight per (x, y) dimension of each of the K key points. This cost function thus has 2K parameters.
This cost extends the previous formulation to provide a learnable weight for each time step t. This adds more flexibility to the cost and allows to capture time-dependent importance of specific keypoints. This cost function has 2TK parameters, which scale linearly with the horizon length.
This cost, like the previous one is also time dependent but allows us to more easily scale to longer time horizons, with 2JK parameters, and J<T. Kernels are uniformly spaced in time and b is chosen to create some overlap between neighboring kernels.
Based on previous works [1, 2], we employ gradient based bi-level optimization to learn the parameters π of the cost function. This involves the following interleaving optimization processes:
[1] S. Bechtle, A. Molchanov, Y. Chebotar, E. Grefenstette, L. Righetti, G. Sukhatme, and F. Meier. Meta-learning via learned loss. arXiv preprint arXiv:1906.05374, 2019.
[2] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.