We present, DEPS (Discovery of GEneralizable Parameterized Skills), an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Our method learns parameterized skill policies jointly with a meta-policy that selects the appropriate discrete skill and continuous parameters at each timestep. Using a combination of temporal variational inference and information-theoretic regularization methods, we address the challenge of degeneracy common in latent variable models, ensuring that the learned skills are temporally extended, semantically meaningful, and adaptable. Our empirical results show that learning parameterized skills from multitask expert demonstrations significantly improves generalization to unseen tasks. Our method outperforms multitask as well as skill learning baselines on both LIBERO and MetaWorld benchmarks. We also demonstrate that our approach discovers interpretable parameterized skills, such as an object grasping skill whose continuous arguments define the grasp location.
Figure 1: Three-level hierarchy of DEPS. The discrete skill policy selects a skill from the library given the full environment observation. Conditioned on that choice, the continuous‑parameter policy outputs continuous parameters. Finally, the low‑level action policy, which sees only a compressed one‑dimensional robot state, produces the primitive action.
Figure 2: Skills as Parameterized Trajectory Manifolds. We hypothesize that a single skill corresponds to a family of parameterized trajectories. A one-dimensional state representation indexes into this manifold to predict actions, promoting generalizability.
DEPS learns reusable parameterized skills from demonstration data with the aim of generalizing to new tasks with minimal additional training data. DEPS learns a three-level hierarchy (Figure 1) consisting of a discrete skill selector that chooses which skill to use (e.g. grasp, move, release), a continuous parameter selector that refines the chosen discrete skill to the specific use-case, and a low-level policy that choses the action based on the the discrete skill, continuous parameter and environment state.
Degenerate Solutions: Simply minimizing the behavior cloning loss using a three-level hierarchy and demonstration trajectories often leads to degenerate solutions that overfit to the provided demonstrations instead of discovering meaningful/generalizable abstractions. . This is especially problematic when state spaces for different tasks have minimal overlap, allowing a high-capacity policy to memorize task-specific behaviors in different state space subsets.
(Re)Conceptualizing Skills: What makes a given policy correspond to a "skill"? Intuitively, a policy that completes different tasks in different subsets of the observation space (e.g. clean the dishes when in the kitchen, pick the toys when in the living room) does not correspond to a skill, while a policy to pick up a toy does correspond to a single skill. We posit that the reason the latter policy is a skill while the first is not is that it exhibits a lower-dimensional structure. Specifically, every trajectory obtained by executing the latter skill lies on the same low-dimensional manifold (i.e., picking up the toy from different initial positions involves the same sequence of action, just with a simple rotational transformation applied). This is not true for the first policy as trajectories that clean dishes are fundamentally different from those that pick toys.
With this view, we view a single skill policy as a family of parameterized trajectories, and a parameterized skill to simply be an extension of the same family of parameterized trajectories, while different parameterized skills involve fundamentally different types of trajectories. When learning a conventional policy over a high-dimensional space, a high-dimensional observation is required to "index" into the policy to retrieve the appropriate action. However, with a single trajectory, a one-dimensional variable is now sufficient to index into the trajectory and retrieve the action (Figure 2).
Avoiding Degenerate Solutions: To ensure DEPs learns generalizable parameterized skills, we make a several key architectural choices:
Projective state compression to one dimension: While High-level policies (discrete and continuous selectors) have access to rich visual observations, the low-level policy receives only a 1D compressed state (computed by projecting the robot's proprioceptive state onto a skill-specific axis). This information asymmetry forces the discrete skill andf continuous parameters to encode skill-relevant information, and allows the learned parameterized skills to generalize across visually diverse environments.
Predicting continuous parameters per discrete skill: During pretraining, we only allow one set of continuous parameters for each discrete skill instance in a trajectory. This encourages temporally extended skills, preventing the continuous parameters from overfitting to specific timesteps.
Skill parameter norm penalty: We regularize the magnitude of continuous parameters to encourage compact, generalizable representations and prevent overfitting to specific tasks.
Using LIBERO and Metaworld-v2 as benchmarks, we provide a diverse and rigorous set of evaluations. Our quantitative results, which can be found in the paper, show that DEPS provides superior generalization to unseen tasks under (i) different environments and task splits (ii) different pretraining budgets (iii) different finetuning budgets (iv) different amounts of pretraining data and (v) different measures of success.
Additionally, we find that DEPS learns interpretable parameterized skills, with the discrete skills and continuous parameters encoding skill-relevant information. We provide three visualizations of this below:
Consistent application of the same discrete skills across tasks and environments.
Smooth variation in the policy of a single discrete skill on varying the continuous parameter.
Overlap in the continuous parameterizations used for a given discrete skill across different tasks.
We find that DEPS discovers semantically meaningful discrete skills corresponding to primitive behaviors such as grasping objects, opening doors, etc. Importantly, we find that the discovered skills are applied consistently across tasks. For example, a "grasp_object"’ skill used to pick a mug in a kitchen environment (top row, first frame) is also used to pick butter in a cabinet environment (top row, second frame). This suggests that the learned segmentation of trajectories into discrete skills is not overfit to specific tasks/environments, but rather represents generalizable subtask decompositions.
Below we show the segmentation of four different LIBERO tasks (columns) as achieved by two different DEPs runs (rows). More comprehensive data showing segmentations across all tasks in LIBERO can be found here.
LIBERO Task 33
LIBERO Task 3
LIBERO Task 55
LIBERO Task 40
We find that for a given discrete skill, slight modifications in the continuous parameter result in smooth variations in the resulting policy. This is shown in the figure below for three sample tasks. This observation suggests that the continuous parameters learned by DEPS are not oerfit to specific task, but instead smoothly modulate the policy of a given discrete skill. More comprehensive data showing this trend holds across the tasks in the LIBERO dataset can be found here.
Original Skill
Value of the first parameter dimension increases by +0.25
Value of the second parameter dimension increases by +0.25
Value of the third parameter dimension increases by +0.25
We find that for a given discrete skill, there is significant overlap in the continuous parameters chosen when executing different tasks. This suggest that the continuous parameterization learn by DEPS encodes skill-specific and not task-specific information, contributing to the method's strong generalization to unseen tasks.
In the visualizations below, we show the continuous parameterizations learned by DEPS for the three most commonly used discrete skills after pretraining on LIBERO. Each point represents the first two principal components of the continuous parameter used for the given discrete skill on a single trajectory in the pretraining dataset. Different shape and color combinations refer to different tasks in the dataset. We find that there is significant intermixing between the points corresponding to different tasks [note: one can zoom into a region by selecting it].