Improved Reinforcement Learning Application Skills
Applied reinforcement learning to the learning scheduling of generative models.
Implemented reinforcement learning algorithms such as Proximal Policy Optimization (PPO).
Expansion of Reinforcement Learning Applications
Applied reinforcement learning to problem-solving outside of typical environments like games or Mujoco.
Confirmed the potential of applying reinforcement learning to GAN models.
Importance of Learning Rate Adjustment
In model training, the learning rate determines the size of parameter updates and is a critical factor that influences performance.
Adjusting the learning rates of the Generator and Discriminator is crucial, especially in GAN models, as it significantly impacts the generated results.
Limitations of Existing Learning Rate Adjustment Methods
Rule-based learning rate scheduling methods adjust the learning rate based on predefined rules but may not guarantee optimal learning rates in different training environments.
This is particularly challenging in GAN models, where ambiguous performance metrics make learning rate adjustment even more difficult.
Develop a learning rate scheduling model that dynamically adjusts the learning rate of GAN models through reinforcement learning, maximizing training performance and improving the quality of generated images.
Verify that the reinforcement learning-based learning rate scheduling outperforms traditional rule-based methods in terms of training performance and convergence speed
CIFAR-10 Dataset
A dataset consisting of images from 10 classes, used for training and evaluating the GAN model.
Composed of 60,000 images, with 6,000 images per class.
Unconditional GAN
For initial experiments with basic learning rate scheduling techniques, a basic GAN structure that generates images without conditions is used.
DCGAN (Deep Convolutional GAN)
A CNN-based GAN structure suitable for learning complex image patterns aims to improve performance through learning rate scheduling.
WGAN (Wasserstein GAN)
A GAN model using Wasserstein Distance for stable training and improved image quality, where reinforcement learning-based learning rate scheduling was applied to verify performance.
State
Used GNN to learn the state of GAN model:
Generator: Composed of TransposedConv and BatchNormalization operations.
Discriminator: Composed of Convolution and BatchNormalization operations.
Features: Includes the mean and variance of weights and bias, the mean and variance of weight gradients and bias gradients, learning rate, and training loss.
Trained target network nodes using GCN (Graph Convolutional Network).
Calculated embeddings for the Generator and Discriminator through the attention mechanism.
The concatenation of Generator and Discriminator embeddings was used as the state for DCGAN.
Overview of State
Reward
Inception Score (IS): Measures the sharpness and diversity of generated images using Pre-trained Inception-v3.
Learned Perceptual Image Path Similarity (LPIPS): Measures the similarity of images using an initial image classification model.
Independent Frechet Inception Distance (iFID): A variation of FID that measures image similarity using Pre-trained Inception-v3.
Action
Proximal Policy Optimization (PPO) Algorithm
Used PPO algorithm to dynamically adjust the learning rate, optimizing model performance by finding the optimal learning rate during training.
Actor & Critic: Used a 2-layer MLP
GAN-LR Scheduler
Set 5,000 iterations of DCGAN as one episode.
Set decision step k and episode number n to train the learning rate schedule.
PPO algorithm for LR scheduling
Overview of GAN-LR scheduler
Rule-based learning rate scheduling
Constant LRS
Step decay LRS
Cosine annealing LRS
Setting
Episode 30 / k 10 / max iteration 5,000/ 𝑙𝑟_0=0.0002
Reward
Compare to baseline
State model
K step
Episode 30 / k 100 / 𝑙𝑟_0 0.0002
Proposed a reinforcement learning model for learning rate scheduling in GAN models.
Conducted experiments with various rewards and state models applicable to GANs.
Challenges in Image Generation Using the GAN-LR Scheduler:
There is no limitation on the advantage, which may cause deviations from the existing policy and violate PPO assumptions.
The hyperparameter settings for reinforcement learning may be insufficient compared to generative model tasks.
Reinforcement learning focuses on the state before the generative model is sufficiently trained within an episode.
When using the Inception score as a reward, the model may overly focus on diversity in the generated results.
Lack of Hyperparameter Exploration:
Additional experiments are needed to explore hyperparameters through varying episode lengths and decision steps.
Further exploration of more appropriate models is required through (various encoder models, reinforcement learning hyperparameter sets, etc ...)