Below are a subset of the projects from previous years for which I have obtained permission to post the project title and abstract online. You can use these for inspiration in choosing your own course project topics.
Please note, though, that the course project requirements have changed since these projects were complete.
Exploration with Expert Policy Advice
Ashwin Khadke, Arpit Agarwal, Anahita Mohseni-Kabir, Devin Schwab:
Exploration for Reinforcement Learning is a challenging problem. Random exploration is often highly inefficient and in sparse reward environments may completely fail. In this work, we developed a novel method which incorporates expert advice for exploration in sparse reward environments. In our formulation, the agent has access to a set of expert policies and learns to bias its exploration based on the experts’ suggested actions. By incorporating expert suggestions the agent is able to quickly learn a policy to reach rewarding states. Our method can mix and match experts’ advice during an episode to reach goal states. Moreover, our formulation does not restrict the agent to any policy set. This allows us to aim for a globally optimal solution. In our experiments, we show that using expert advice indeed leads to faster exploration in challenging grid-world environments.
Know What You Know and Learn What You Don’t
Victoria Dean and Adam Villaflor:
Recent work in deep reinforcement learning has valued end-to-end learning, which is effective for tasks in simulation[1][2][3] and simple real-world tasks[4]. However, end-to-end learning can be impractical for complex tasks that are difficult to model, such as autonomous driving or unsupervised exploration. In this work, we make the case for simplifying the learning problem by using higher- level input and output spaces. This simplification removes part of the complexity, as the model does not have to learn low-level aspects such as perception or motor control. Additionally, by using higher-level actions, the agent can produce more meaningful trajectories throughout the learning process, improving the stability and efficiency of learning. These more meaningful trajectories are more inter- pretable, as it is easier to determine the intent of the agent during initial training and at test time. We present results on simplifying the state and action spaces for two different tasks: exploration in the Super Mario Bros game and lane-changing in a car simulator based off of the highD dataset [5]. In Super Mario Bros, we change the intrinsic reward modality from images to audio, which improves the sample efficiency. In our car simulator environment, our agent intermittently out- puts high-level lane change actions instead of controlling steering and throttle di- rectly, which leads to significantly better results throughout the training process.
Active Exploration for Real-to-Sim for Sim-to-Real
Jacky Liang and Shivam Vats
Training robotic policies in simulation suffers from the sim-to-real gap, as simulated dynamics can be different from real-world dynamics. Past works tackled this problem through domain randomization (DR) and online system-identification (Sys-ID). The former is sensitive to the manually-specified training distribution of dynamics parameters and can result in behaviors that are too conservative, and the latter requires learning policies that concurrently perform the task and generate useful trajectories for Sys-ID. In this work, we train an explo- ration policy that explicitly performs task-specific exploration actions to identify physics parameters. These parameters are then used in simulations for model-based trajectory optimization algorithms, which perform the task in the real world. We implement the proposed framework in a linear system LQR task and in simulation experiments with a Franka Panda robot arm on a 2D dragging task. Empirically, we show that (1) task performance depends on the accuracy of the physics parameters used for optimization and (2) there is a broad range of simulation parameters that can produce a task-satisfying trajectory. Further, our analysis of the objective function for the optimal exploration policy to minimize task regret in the linaer system and LQR case aligns with these observations.
Improved Robotic Exploration through Additional Information
Victoria Dean and Adam Villaflor
Many tasks that humans would like robots to perform in household or factory environments would require robots to be proficient in object manipulation. However, tra- ditional robotic controllers struggle in these domains due to their high dimensionality, complex contact dynamics, and limited observability with traditional sensor modalities. Model free deep reinforcement learning is a promising approach towards address- ing the high dimensionality and complex dynamics because its avoids explicitly modeling the contact dynamics. Additionally, incorporating tactile sensing can alleviate the observability issues as the robot can better ascertain the state of a manipulated object. Ideally, we would like to use sparse reward functions in these domain as it can be very tedious to hand-design dense reward functions that lead to effective and robust policies. Unfortunately, traditional deep reinforcement learning approaches struggle to learn with sparse rewards due to the significant exploration problem. This issue is exacerbated in manipulation tasks were a series of precise controls is often needed to achieve the goal. Thus, in this work we explore incorporating demonstrations in RL in order to learn effective policies in sparse reward manipulation tasks. We validate the importance of using tactile sensing on a box closing task, and of incorporating demonstrations in a fridge closing task.
Agent Probing Interaction Policies
Siddharth Ghiya, Oluwafemi Azeez, and Brendan Miller
Reinforcement learning in a multi agent system is difficult because these systems are inherently non-stationary in nature. In such a case, identifying the type of the opposing agent is crucial and can help us address this non-stationary environment. We have investigated if we can employ some probing policies which help us better identify the type of the other agent in the environment. We’ve made a simplifying assumption that the other agent has a stationary policy from a fixed set that our probing policy is trying to classify. Our work extends Environmental Probing Interaction Policy framework to handle multi agent environments.
Learning Off-Policy with Online Planning
Harshit Sikchi and Wenxuan Zhou:
We propose Learning Off-Policy with Online Planning (LOOP), combining the techniques from model-based and model-free reinforcement learning algorithms. The agent learns a model of the environment, and then uses trajectory optimization with the learned model to select actions. To sidestep the myopic effect of fixed horizon trajectory optimization, a value function is attached to the end of the planning horizon. This value function is learned through off-policy reinforcement learning, using trajectory optimization as its behavior policy. Furthermore, we introduce "actor-guided" trajectory optimization to mitigate the actor-divergence issue in the proposed method. We benchmark our methods on continuous control tasks and demonstrate that it offers a significant improvement over the underlying model-based and model-free algorithms.