Le Mao, Andrew H. Liu, Renos Zabounidis, Yanan Niu, Zachary Kingston, and Joseph Campbell
Purdue CMU EPFL
Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than directly conditioning on these noisy signals, CDE trains a policy to reconstruct the concepts via an auxiliary objective, learning general representations of the concepts and using reconstruction accuracy as an intrinsic reward to guide exploration toward task-relevant objects. Across five challenging simulated visual manipulation tasks, CDE achieves efficient, targeted exploration and remains robust to both synthetic errors and noisy VLM predictions. Finally, we demonstrate real-world transfer by deploying CDE on a Franka arm, attaining an 80% success rate in a real-world manipulation task.
Synthetic Noise
Ablation Study
Exploration Heatmap
Open the microwave door
Turn on the light switch
Open the slide cabinet door
Turn on the upper left burner knob
Lift the block
We also conduct experiment with VLM-generated masks.
Zero-Shot Segmentation
Video