Hyperparameters and Reproducibility
Network Architectures:
Actor network
Critic network
fully connected with hidden dimensions [800,600,32].
follows Bellemare et al. to output a distribution, with 100 atoms.
Encoder network
We use the same architecture when running baselines, e.g. BC baseline uses the same actor network as described above.
Hyperparameters - ODA Pretraining
actor lr: 3e-4
critic lr: 3e-4
encoder lr: 3e-4
adam beta1: 0.88
adam beta2: 0.92
sample offline data batch size: 1024
sample demo data batch size: 64
latent dimension: 12
beta: 0.01
action noise: 0.05
discount factor: 0.99
Action space and observation space are normalized to [-1, 1]
We perform data collection and training asynchronously, and cap learn/act ratio at 3.
Hyperparameters - ODA Finetuning
actor lr: 3e-5
critic lr: 3e-5
encoder lr: 3e-5
We implemented lr warm up for the first 10k gradient steps.
The rest of parameters are the same as pretraining.
Hyperparameters - AWAC Pretraining/Finetuning
Same as ODA Pretraining/Finetuning, without the need for encoder related parameters.
Hyperparameters - BC
Same as ODA Pretraining, without the need for encoder and critic related parameters.
Stopping Criteria
We use one training task "E-model-10p" to select the training length for each method, and use it for all evaluations of test tasks.
For finetuning, we stop training when 50 consecutive successes are reached. Then evaluate the last checkpoint with 100 trials.