Advanced Applied Deep Learning
Lecture Course
Sheng Yun Wu
Lecture Course
Sheng Yun Wu
Objective:
To teach students advanced optimization techniques used to improve the performance and convergence speed of convolutional neural networks (CNNs). Students will learn about various optimization algorithms, how learning rates influence model training, and how to apply learning rate scheduling techniques. By the end of the week, students should understand how to select and implement appropriate optimization methods for deep learning models.
Lecture 1: Overview of Optimization Algorithms
5.1 The Role of Optimizers in Deep Learning
What is an Optimizer?
Optimizers are algorithms used to minimize the loss function by adjusting the network’s weights during training.
The goal is to find the best set of weights that minimize the loss on both the training and validation sets.
Gradient Descent Recap:
The simplest optimization algorithm is Gradient Descent, which updates weights using the gradient of the loss function with respect to the parameters.
Standard gradient descent can be too slow or unstable, so variants of it are often used.
5.2 Types of Gradient Descent:
Batch Gradient Descent:
Calculates the gradient for the entire dataset before updating weights.
Slow but stable, especially for large datasets.
Stochastic Gradient Descent (SGD):
Updates weights after computing the gradient for each data point.
Faster but can be noisier and lead to more erratic updates.
Mini-batch Gradient Descent:
A compromise between batch and stochastic gradient descent.
Uses a small batch of data points to compute gradients and update weights.
Faster than batch gradient descent and more stable than SGD.
Lecture 2: Advanced Optimizers
5.3 Adam Optimizer
What is Adam?
Adam (Adaptive Moment Estimation) is one of the most widely used optimizers in deep learning.
Combines the benefits of both RMSprop and momentum.
How Adam Works:
Keeps track of exponentially decaying averages of past gradients and squared gradients.
Adjusts the learning rate for each parameter individually based on the historical gradients.
Advantages of Adam:
Works well with noisy data and sparse gradients.
Requires little hyperparameter tuning compared to other optimizers.
5.4 RMSprop (Root Mean Squared Propagation)
How RMSprop Works:
Similar to Adam, but only keeps track of the exponentially decaying average of squared gradients.
Scales the learning rate based on the recent magnitude of gradients.
When to Use RMSprop:
Often used in recurrent neural networks (RNNs) and other deep architectures.
Good for dealing with non-stationary objectives (e.g., learning rates that adapt over time).
5.5 Momentum-based Optimization
What is Momentum?
Momentum is used to accelerate gradient descent by considering the previous gradients to smooth out the updates.
Nesterov Accelerated Gradient (NAG):
A variant of momentum that computes the gradient at the future predicted point instead of the current position.
Provides faster convergence and better accuracy.
Lecture 3: Learning Rate Scheduling Techniques
5.6 Importance of Learning Rate in Model Training
What is a Learning Rate?
The learning rate determines how much to adjust the model weights after each step through the data.
Challenges with Fixed Learning Rates:
If the learning rate is too high, the model may oscillate and not converge.
If the learning rate is too low, training may be slow and get stuck in local minima.
5.7 Learning Rate Schedulers
Learning Rate Decay:
Gradually decreases the learning rate over time as the model approaches convergence.
Helps prevent overshooting the global minimum and stabilizes training.
Step Decay:
Reduces the learning rate by a factor (e.g., by half) after a predefined number of epochs.
Exponential Decay:
Decreases the learning rate exponentially over time for smoother learning rate adjustments.
Cosine Annealing:
Starts with a high learning rate and gradually reduces it following a cosine curve.
Can be used with a restart mechanism to allow the model to escape local minima.
Cyclical Learning Rates:
Alternates between high and low learning rates during training to allow for exploration of the loss surface.
Practical Session: Implementing Advanced Optimizers and Learning Rate Scheduling
Objective: Train a CNN model using different optimizers and learning rate scheduling techniques to observe the impact on performance and convergence speed.
Dataset: CIFAR-10 or Fashion MNIST dataset.
Key Steps:
Step 1: Build a Baseline CNN Model
Use a standard CNN architecture (e.g., 2 convolutional layers, followed by fully connected layers) to classify images.
Step 2: Train with Different Optimizers
Train the model using SGD, Adam, and RMSprop.
Compare the training and validation loss, accuracy, and convergence speed for each optimizer.
Step 3: Apply Momentum and Nesterov Accelerated Gradient (NAG)
Implement momentum-based SGD and Nesterov accelerated gradient for comparison.
Step 4: Experiment with Learning Rate Scheduling
Apply different learning rate schedulers (e.g., step decay, exponential decay, and cosine annealing).
Plot the learning rate over epochs and observe its impact on model training.
Step 5: Evaluate Model Performance
Evaluate the trained models on the test set and compare results in terms of accuracy, loss, and convergence speed.
Visualize the training process with learning curves for different optimizers and learning rates.
Assignment for Week 5:
Coding Assignment:
Train a CNN model using the Adam optimizer and compare it with RMSprop and SGD with momentum.
Experiment with different learning rate schedulers (e.g., step decay, exponential decay, cosine annealing).
Visualize the impact of different optimizers and learning rate schedules on the training and validation performance.
Analysis:
Compare the performance and convergence speed of different optimizers.
Analyze the effect of learning rate scheduling on convergence and final model performance.
Reading Assignment:
Read Chapter 6 of "Advanced Applied Deep Learning" by Umberto Michelucci.
Focus on understanding how different optimizers work and when to use each in CNN training.
Summary of Key Concepts:
Optimizers: Methods for adjusting weights to minimize the loss function, including Adam, RMSprop, and SGD with momentum.
Learning Rate Scheduling: Techniques to adjust the learning rate during training to improve convergence and avoid local minima.
Practical application of advanced optimization techniques and learning rate scheduling in CNN training.
This week equips students with the necessary knowledge and skills to select and apply advanced optimization techniques and learning rate scheduling in deep learning. These skills are crucial for training efficient and high-performing CNN models.