Additional Experimental Details:
Here, we provide additional experimental details, including details of our tuning procedure for offline RL and BC for all the domains.
Atari domains:
BC:
1. Policy class: Categorical policy with |A| outputs
2. Policy hidden layer sizes: IMPALA Resnet without any layer norm layer
Oracle early stopping based on smoothened performance
Naive CQL:
1. Q-function hidden layers: Regular multi-headed DQN architecture on Atari (3 conv layers, 2 feed forward layers)
2. alpha hyperparameter in CQL: 0.1
Tuned CQL:
1. Q-function hidden layers: IMPALA Resnet architecture without any layer norm layer
2. alpha hyperparameter in CQL: 0.1
3. Underfitting correction: using DR3 regularizer from prior based with coefficient 0.03, which is not tuned and directly taken from prior work.
Manipulation Domains:
BC:
1. Policy class: Tanh-Gaussian policy
2. Policy hidden layer sizes: Resnet IMPALA architecture or 3 conv layers and 2 feed-forward layers, default from COG (Singh et al. 2020)
In our experiments, we found that the Resnet IMPALA architecture and the simple convolutional network perform similarly, and we even tried overfitting correction techniques like dropout, but didn't find it to make a difference in terms of actual policy performance.
Naive CQL:
1. Policy class: Tanh-Gaussian policy
2. Policy hidden layer sizes: 3 conv layers and 2 feed-forward layers, default from COG (Singh et al. 2020)
3. Q-function hidden layer sizes: 3 conv layers and 2 feed-forward layers, default from COG (Singh et al. 2020)
4. Q-function overfitting correction: None
5. Policy checkpoint selection: Picked the last one at 1M gradient steps
6. alpha hyperparameter in CQL: 1.0 (based on Singh et al. 2020)
Tuned CQL:
1. Policy class: Tanh-Gaussian policy
2. Policy hidden layer sizes: 3 conv layers and 2 feed-forward layers, default from COG (Singh et al. 2020)
3. Q-function hidden layer sizes: 3 conv layers and 2 feed-forward layers, default from COG (Singh et al. 2020)
4. Q-function overfitting correction: None
5. Policy checkpoint selection: At the peak in Q-values, 200k gradient steps
6. alpha hyperparameter in CQL: 1.0
Due to limited compute, we did not perform any sort of offline tuning over the alpha value in the submission, but since the Q-values show an eventually decreasing trend, we might be able to find a better alpha that alleviates the drop in Q-values and that likely performs better as a result of less overfitting.
Q-function and policy learning rates for both naive CQL and tuned CQL were chosen to be the default from COG (Singh et al. 2020): Q-function = 3e-4, policy = 1e-4, and were not tuned.
Adroit domains:
BC:
1. Policy class: Gaussian policy with a learned per-state standard-deviation
2. Hidden layer sizes: (256, 256, 256, 256)
3. Dropout probability (per-layer): 0.2
Naive CQL:
1. Policy class: Gaussian policy with a learned per-state standard deviation (identical to BC)
2. Policy hidden layer sizes: (256, 256, 256, 256)
3. Dropout probability (per-layer) on the policy: 0.2
4. Q-function hidden layer sizes: (256, 256, 256)
5. alpha hyperparameter in CQL: 10.0
Tuned CQL:
1. Policy class: Gaussian policy with a learned per-state standard deviation (identical to BC)
2. Policy hidden layer sizes: (256, 256, 256, 256)
3. Dropout probability (per-layer) on the policy: 0.2
4. Q-function hidden layer sizes: (256, 256, 256, 256)
5. Q-function overfitting correction: Dropout probability of 0.4
5. alpha hyperparameter in CQL: 1.0 for hammer-human, 5.0 for pen-human and 10.0 for relocate-human and door-human (As mentioned in the paper, we ran an offline search over multiple alpha values and picked the smallest alpha that did not lead to a divergent Q-function)
6. Policy checkpoint selection: The latest checkpoint after the peak in Q-values (which appeared at 40k training gradient steps for pen-human, hammer-human, 60k for door-human, and 20k for relocate-human)
Q-function and policy learning rates for both naive CQL and tuned CQL were chosen to be the default from COG (Singh et al. 2020): Q-function = 3e-4, policy = 1e-4
AntMaze domains:
BC:
1. Policy class: Tanh-Gaussian policy
2. Policy hidden layer sizes: (256, 256, 256) [we chose this out of two architectures we evaluated]
Naive CQL:
1. Policy class: Tanh-Gaussian policy
2. Policy hidden layer sizes: (256, 256, 256)
3. Q-function hidden layer sizes: (256, 256, 256)
4. Q-function overfitting correction: None
5. alpha hyperparameter: 0.1
6. Binary levels of rewards: [-0.1, 10.0] instead of [0, 1] (note that this still does not convey any additional information than a [0, 1] reward and preserves the identical optimal policy)
7. Policy checkpoint selection: at the end of training (500k gradient steps)
Tuned CQL:
1. Policy class: Tanh-Gaussian policy
2. Policy hidden layer sizes: (256, 256, 256)
3. Q-function hidden layer sizes: (256, 256, 256)
4. Q-function overfitting correction: None
5. Q-function underfitting correction: DR3, dot product regularizer over the last-but-one-layer features, with coefficient=0.03 directly taken from prior work
5. alpha hyperparameter: 0.1 for medium maze, 1.0 for big maze (found by searching over a range of values mentioned in the paper, and picking the smallest one with non-divergent Q-values)
6. Binary levels of rewards: [-0.1, 10.0] instead of [0, 1] (note that this still does not convey any additional information than a [0, 1] reward and preserves the identical optimal policy)
7. Policy checkpoint selection: 500k gradient steps