By Oct 5 at 11:59 PM EST submit your project proposal along with a list of your team members
By Tuesday Nov 9 at 11:59 PM EST submit a midterm progress report
By Dec 12 at 11:59 PM EST submit your final report of your project
All project reports will be submitted via Gradescope.
Submissions for class project will be penalized by a deduction in points by 1/3 per day (except for the final report, which has no late days)
The final report has no late days - if you submit it late you will get 0 credit for that component; it needs to be submitted on time so that we can submit the final grades on time
Only 1 submission is needed per team the project submissions.
The goal of the class project is to make sure that you can apply deep reinforcement learning to new problems that you might encounter after the class. With this goal in mind, the project is structured as follows:
Pick an RL environment for your project. You can use any environment that you choose! You can reuse an existing environment or create a new one. Here are some example environments that you can use to train your RL agent:
Standard gym environments: Atari, MuJoCo, Toy Text, Classic Control, Box2D (can be installed with OpenAI Gym)
Third party environments: A long list of environments compatible with OpenAI Gym
Gym Retro - a platform for reinforcement learning research on games
You should not choose an environment that is "too simple" such as cartpole or inverted pendulum.
Your life will be much easier if the environment follows the Gym interface (you will see that most RL environments that you find online use this interface)- see details here: https://www.gymlibrary.ml/
Set up a reward function (you can reuse an existing reward function or create a new one to define a new task)
Show the performance of a random agent in this environment (e.g. an agent that takes random actions)
Train 3 different RL methods in this environment
On-policy Policy-gradient RL
Off-policy Q-function-based RL
Model-based RL
To clarify, you are allowed to use existing RL libraries for your project. The idea of the project is to help you get used to how you might use RL for an application after the class is complete. This type of learning is meant to complement the learning that you do for the homeworks in which you will actually implement (parts of) RL algorithms yourself. You are not required to use any existing RL libraries; if you prefer, you can implement the algorithms yourself from scratch, or to reuse code from the homeworks. If you choose to use RL libraries, here are a few that you might find helpful:
Next, you should propose 2 modifications. These modifications can be as big as proposing a new RL algorithm that you want to create, or it can be a smaller change. Here are some examples of modifications (though you should not feel constrained by this list - these are just examples):
Change the input domain or add another sensor (e.g. state, RGB images, depth images, point clouds, tactile sensing, force feedback, etc)
Change the network architecture
Change the action space (e.g. end-effector position control, end-effector force control, joint torques, joint angles, high-level action primitives, different gripper type, etc)
Change the reward function (e.g. add intermediate rewards to guide the agent to achieve the task)
Add an auxiliary loss function
Changing the environment in some way (e.g. adding more obstacles)
Compare different RL algorithms (for example, try a few different types of model-based RL algorithms and compare their performance)
Modify an existing RL algorithm in some way
Varying algorithm hyper-parameters
Create a new RL algorithm
Feel free to discuss your project ideas with a TA or instructor.
You will be required to submit a writeup of your class project; see below for details.
You are recommended to start forming a team as soon as possible (or you can work by yourself). The project must be in a group of size 1-2. If you do not know anyone else in the class, turn to your neighbors before class and introduce yourself! You can also post on Piazza if you are looking for teammates.
Once you have a team, please register it on Canvas under the the People->Groups Section.
Your grade for the class project will be subdivided as follows:
Project proposal: 20%
Midterm Report: 30%
Final Report: 50%
The instructions for each of these components are as follows:
Download the latest RSS LaTeX template and use it for your project writeups. Add your title, names of authors, affiliation and abstract according to RSS guidelines. Make sure that the authors are not listed as anonymous so that we can grade your submission!
Your proposal will be evaluated as follows (20 total points):
Report follows the RSS template (1 point)
Environment (2 points): Describe the environment that you will use for your experiments, in detail. Please include images of your environment. You should not choose an environment that is "too simple" such as cartpole or inverted pendulum.
Reward function (1 points): Describe your reward function. You can change this later on in the course if you like.
Method: Describe the 2 modifications that you are going to try. You can change these later on in the course if you like. Points are as follows:
Proposed modification #1 is clearly explained (4 points)
Proposed modification #2 is clearly explained (4 points)
Results:
Includes a plot of the performance of a random agent (e.g. an agent that takes random actions) (4 points)
Link to a website showing videos of the performance of the random agent (e.g. an agent that takes random actions) (4 points)
As before, please use the latest RSS template. You should have completed the experiments on "On-policy Policy-gradient RL" and "Off-policy Q-function-based RL", but you do not need to have yet the experiments on "Model-based RL" or your two modifications.
Midterm reports will be evaluated as follows (20 total points) - new sections are bolded in red. You are welcome to change any of the sections from the project proposal if you wish.
Report follows the RSS template (1 point)
Environment (1 points): Describe the environment that you will use for your experiments, in detail; You should not choose an environment that is "too simple" such as cartpole or inverted pendulum.
Reward function (1 points): Describe your reward function. You can change this later on in the course if you like.
Method: Describe the 2 modifications that you are going to try. You can change these later on in the course if you like. Points are as follows:
Proposed modification #1 is clearly explained; the modifications should be explained in more detail for the midterm than in the initial proposal. (3 points)
Proposed modification #2 is clearly explained; the modifications should be explained in more detail for the midterm than in the initial proposal. (3 points)
Results (Note that the performance of the different methods should be on the same figure):
Includes a plot of the performance of a random agent (e.g. an agent that takes random actions) (3 points)
Includes a plot of the experiment for "On-policy Policy-gradient RL" (same figure as above) (3 points)
Experiment for "On-policy Policy-gradient RL" is working significantly better than the random agent (3 points). The agent does not need to perfectly solve the task or even perform particularly well; the main point is just to show that your policy is learning something, by demonstrating that the performance is better than random.
Includes a plot for experiment "Off-policy Q-function-based RL" (same figure as above) (3 points)
Experiment for "Off-policy Q-function-based RL" is working significantly better than the random agent (3 points). The agent does not need to perfectly solve the task or even perform particularly well; the main point is just to show that your policy is learning something, by demonstrating that the performance is better than random.
Analysis of current results: 3 points
Link to a website showing videos of the performance of the different agents (updated to include new agents) (3 points)
As before, please use the RSS template. You should have all experiments complete.
Final reports will be evaluated as follows (total 50 points) - new sections are bolded in red. You are welcome to change any of the sections from the midterm report if you wish.
Report follows the RSS template (1 point)
Environment (1 points): Describe the environment that you will use for your experiments, in detail. You should not choose an environment that is "too simple" such as cartpole or inverted pendulum.
Reward function (1 points): Describe your reward function.
Method
Proposed modification #1 is clearly explained; the modifications should be explained in more detail for the final report than in the initial proposal. (2 points)
Proposed modification #2 is clearly explained; the modifications should be explained in more detail for the final report than in the initial proposal. (2 points)
Results (Note that the performance of the different methods should be on the same figure):
Includes a plot of the performance of a random agent (e.g. an agent that takes random actions) (2 points)
Includes a plot of the experiment for "On-policy Policy-gradient RL" (same figure as above) (2 points)
Experiment for "On-policy Policy-gradient RL" is working significantly better than the random agent (3 points)
Includes a plot experiment for "Off-policy Q-function-based RL" (same figure as above) (3 points)
Experiment for "Off-policy Q-function-based RL" is working significantly better than the random agent (3 points)
Includes a plot experiment for "Model-based RL" (same figure as above) (3 points)
Experiment for "Model-based RL" is working significantly better than the random agent (3 points). The agent does not need to perfectly solve the task or even perform particularly well; the main point is just to show that your policy is learning something, by demonstrating that the performance is better than random.
Includes a plot experiment for "Modification 1" (same figure as above) (3 points)
Experiment for "Modification 1" is working significantly better than the random agent (3 points). The agent does not need to perfectly solve the task or even perform particularly well; the main point is just to show that your policy is learning something, by demonstrating that the performance is better than random.
Includes a plot experiment for "Modification 2" (same figure as above) (3 points)
Experiment for "Modification 2" is working significantly better than the random agent (3 points). The agent does not need to perfectly solve the task or even perform particularly well; the main point is just to show that your policy is learning something, by demonstrating that the performance is better than random.
Analysis of results (updated to analyze all results): 3 points
Link to a website showing videos of the performance of the different agents (updated to include all new agents) (3 points)
Conclusion and future work:
Conclusion summarizes the main takeaways from the experiments: 3 points
Future work describes some interesting future directions: 3 points