Advanced Applied Deep Learning
Practice Course
Sheng Yun Wu
Practice Course
Sheng Yun Wu
In Week 5, students will learn and implement advanced optimization techniques that are critical for improving the performance and training efficiency of deep learning models. This includes optimizers like Adam, RMSprop, and SGD with momentum, as well as techniques like learning rate schedules and gradient clipping. These methods are essential for achieving better convergence and faster training, especially in deep neural networks.
Description:
This example introduces the basic SGD optimizer and demonstrates how to use it to train a simple model.
import tensorflow as tf
from tensorflow.keras import models, layers
from tensorflow.keras.optimizers import SGD
# Load MNIST dataset
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
# Reshape and normalize the data
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255
# Display shape of the dataset
print(f"Train images shape: {train_images.shape}")
print(f"Test images shape: {test_images.shape}")
# Build a simple model
model = models.Sequential([
layers.Flatten(input_shape=(28, 28, 1)),
layers.Dense(512, activation='relu'),
layers.Dense(10, activation='softmax')
])
# Compile the model using SGD optimizer
model.compile(optimizer=SGD(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))
Output
Train images shape: (60000, 28, 28, 1)
Test images shape: (10000, 28, 28, 1)
Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.6017 - accuracy: 0.8540 - val_loss: 0.3415 - val_accuracy: 0.9110
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3221 - accuracy: 0.9112 - val_loss: 0.2767 - val_accuracy: 0.9245
Epoch 3/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.2747 - accuracy: 0.9228 - val_loss: 0.2480 - val_accuracy: 0.9311
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2441 - accuracy: 0.9316 - val_loss: 0.2236 - val_accuracy: 0.9379
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2211 - accuracy: 0.9386 - val_loss: 0.2063 - val_accuracy: 0.9426
Description:
Momentum helps accelerate SGD in the relevant direction and dampens oscillations. This example shows how to implement SGD with momentum.
# Compile the model using SGD with momentum
model.compile(optimizer=SGD(learning_rate=0.01, momentum=0.9), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))
Output
Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1757 - accuracy: 0.9496 - val_loss: 0.1453 - val_accuracy: 0.9569
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1086 - accuracy: 0.9685 - val_loss: 0.1023 - val_accuracy: 0.9687
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0790 - accuracy: 0.9776 - val_loss: 0.0845 - val_accuracy: 0.9750
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0618 - accuracy: 0.9822 - val_loss: 0.0739 - val_accuracy: 0.9766
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0495 - accuracy: 0.9858 - val_loss: 0.0730 - val_accuracy: 0.9772
Description:
RMSprop is designed to adapt the learning rate for each parameter. It is especially useful in models where the learning rate needs to change frequently.
from tensorflow.keras.optimizers import RMSprop
# Compile the model using RMSprop optimizer
model.compile(optimizer=RMSprop(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))
Output
Epoch 1/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.0786 - accuracy: 0.9750 - val_loss: 0.0950 - val_accuracy: 0.9726
Epoch 2/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.0606 - accuracy: 0.9822 - val_loss: 0.0962 - val_accuracy: 0.9750
Epoch 3/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0484 - accuracy: 0.9868 - val_loss: 0.0872 - val_accuracy: 0.9783
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0384 - accuracy: 0.9894 - val_loss: 0.0915 - val_accuracy: 0.9793
Epoch 5/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0328 - accuracy: 0.9909 - val_loss: 0.0963 - val_accuracy: 0.9801
Description:
Adam is one of the most popular optimization algorithms. This example demonstrates how to use Adam, which combines the advantages of both RMSprop and SGD with momentum.
from tensorflow.keras.optimizers import Adam
# Compile the model using Adam optimizer
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))
Output
Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0396 - accuracy: 0.9877 - val_loss: 0.0946 - val_accuracy: 0.9765
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0305 - accuracy: 0.9899 - val_loss: 0.1016 - val_accuracy: 0.9762
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0239 - accuracy: 0.9926 - val_loss: 0.1128 - val_accuracy: 0.9735
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0201 - accuracy: 0.9933 - val_loss: 0.1002 - val_accuracy: 0.9781
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0181 - accuracy: 0.9944 - val_loss: 0.1132 - val_accuracy: 0.9777
Description:
Learning rate scheduling dynamically adjusts the learning rate during training to speed up convergence. This example introduces a simple learning rate schedule.
from tensorflow.keras.callbacks import LearningRateScheduler
# Define a learning rate schedule
def lr_schedule(epoch):
lr = 0.001
if epoch > 10:
lr *= 0.1
return lr
# Add learning rate scheduler callback
lr_scheduler = LearningRateScheduler(lr_schedule)
# Compile and train the model with learning rate scheduler
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=15, validation_data=(test_images, test_labels), callbacks=[lr_scheduler])
Output
Epoch 1/15
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0194 - accuracy: 0.9939 - val_loss: 0.1149 - val_accuracy: 0.9776
Epoch 2/15
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0150 - accuracy: 0.9952 - val_loss: 0.1177 - val_accuracy: 0.9779
Epoch 3/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0154 - accuracy: 0.9950 - val_loss: 0.0980 - val_accuracy: 0.9812
Epoch 4/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0120 - accuracy: 0.9963 - val_loss: 0.1155 - val_accuracy: 0.9795
Epoch 5/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0127 - accuracy: 0.9961 - val_loss: 0.1160 - val_accuracy: 0.9807
Epoch 6/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0106 - accuracy: 0.9963 - val_loss: 0.1314 - val_accuracy: 0.9790
Epoch 7/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0109 - accuracy: 0.9967 - val_loss: 0.1296 - val_accuracy: 0.9801
Epoch 8/15
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0099 - accuracy: 0.9970 - val_loss: 0.1162 - val_accuracy: 0.9820
Epoch 9/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0093 - accuracy: 0.9973 - val_loss: 0.1269 - val_accuracy: 0.9802
Epoch 10/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0102 - accuracy: 0.9969 - val_loss: 0.1335 - val_accuracy: 0.9817
Epoch 11/15
1875/1875 [==============================] - 4s 2ms/step - loss: 0.0082 - accuracy: 0.9974 - val_loss: 0.1278 - val_accuracy: 0.9815
Epoch 12/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0019 - accuracy: 0.9995 - val_loss: 0.1137 - val_accuracy: 0.9843
Epoch 13/15
1875/1875 [==============================] - 3s 2ms/step - loss: 3.0625e-04 - accuracy: 0.9999 - val_loss: 0.1115 - val_accuracy: 0.9849
Epoch 14/15
1875/1875 [==============================] - 3s 2ms/step - loss: 1.0444e-04 - accuracy: 1.0000 - val_loss: 0.1120 - val_accuracy: 0.9846
Epoch 15/15
1875/1875 [==============================] - 3s 2ms/step - loss: 6.3511e-05 - accuracy: 1.0000 - val_loss: 0.1113 - val_accuracy: 0.9847
Description:
This example demonstrates how to use an exponentially decaying learning rate to gradually reduce the learning rate during training.
from tensorflow.keras.optimizers.schedules import ExponentialDecay
# Define an exponential decay learning rate schedule
lr_schedule = ExponentialDecay(initial_learning_rate=0.001, decay_steps=100000, decay_rate=0.96)
# Compile the model with the learning rate schedule
model.compile(optimizer=Adam(learning_rate=lr_schedule), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=15, validation_data=(test_images, test_labels))
Output
Epoch 1/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0105 - accuracy: 0.9969 - val_loss: 0.1389 - val_accuracy: 0.9817
Epoch 2/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0070 - accuracy: 0.9977 - val_loss: 0.1568 - val_accuracy: 0.9807
Epoch 3/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0068 - accuracy: 0.9982 - val_loss: 0.1399 - val_accuracy: 0.9820
Epoch 4/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0063 - accuracy: 0.9983 - val_loss: 0.1439 - val_accuracy: 0.9818
Epoch 5/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0078 - accuracy: 0.9980 - val_loss: 0.1619 - val_accuracy: 0.9791
Epoch 6/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0063 - accuracy: 0.9980 - val_loss: 0.1463 - val_accuracy: 0.9813
Epoch 7/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0066 - accuracy: 0.9979 - val_loss: 0.1545 - val_accuracy: 0.9805
Epoch 8/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0058 - accuracy: 0.9984 - val_loss: 0.1455 - val_accuracy: 0.9837
Epoch 9/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0071 - accuracy: 0.9984 - val_loss: 0.1642 - val_accuracy: 0.9806
Epoch 10/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0082 - accuracy: 0.9980 - val_loss: 0.1638 - val_accuracy: 0.9810
Epoch 11/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0036 - accuracy: 0.9990 - val_loss: 0.1511 - val_accuracy: 0.9832
Epoch 12/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0073 - accuracy: 0.9983 - val_loss: 0.1586 - val_accuracy: 0.9812
Epoch 13/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0043 - accuracy: 0.9987 - val_loss: 0.2114 - val_accuracy: 0.9786
Epoch 14/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0075 - accuracy: 0.9978 - val_loss: 0.1744 - val_accuracy: 0.9811
Epoch 15/15
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0053 - accuracy: 0.9985 - val_loss: 0.1643 - val_accuracy: 0.9828
Description:
Gradient clipping helps prevent the exploding gradient problem by limiting the magnitude of the gradients. This example shows how to apply gradient clipping to the Adam optimizer.
# Compile the model with gradient clipping
model.compile(optimizer=Adam(clipvalue=1.0), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))
Output
Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0060 - accuracy: 0.9983 - val_loss: 0.2222 - val_accuracy: 0.9775
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0080 - accuracy: 0.9980 - val_loss: 0.2305 - val_accuracy: 0.9761
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0045 - accuracy: 0.9989 - val_loss: 0.1727 - val_accuracy: 0.9822
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0065 - accuracy: 0.9983 - val_loss: 0.1771 - val_accuracy: 0.9827
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0056 - accuracy: 0.9986 - val_loss: 0.1949 - val_accuracy: 0.9809
Description:
This example introduces cyclical learning rates (CLR), where the learning rate fluctuates between a lower and upper bound throughout training, often leading to faster convergence.
from tensorflow.keras.callbacks import Callback
import numpy as np
# Define a cyclical learning rate callback
class CyclicLR(Callback):
def __init__(self, base_lr=0.001, max_lr=0.006, step_size=2000., mode='triangular'):
self.base_lr = base_lr
self.max_lr = max_lr
self.step_size = step_size
self.mode = mode
self.lr = base_lr
super(CyclicLR, self).__init__()
def on_batch_end(self, batch, logs=None):
cycle = np.floor(1 + batch / (2 * self.step_size))
x = np.abs(batch / self.step_size - 2 * cycle + 1)
if self.mode == 'triangular':
self.lr = self.base_lr + (self.max_lr - self.base_lr) * np.maximum(0, (1 - x))
# Apply the new learning rate
tf.keras.backend.set_value(self.model.optimizer.lr, self.lr)
# Compile the model with Adam optimizer
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model with Cyclical Learning Rate
clr = CyclicLR(base_lr=0.001, max_lr=0.006, step_size=2000)
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels), callbacks=[clr])
Output
Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1130 - accuracy: 0.9845 - val_loss: 0.4001 - val_accuracy: 0.9611
Epoch 2/5
1875/1875 [==============================] - 4s 2ms/step - loss: 0.1063 - accuracy: 0.9848 - val_loss: 0.4311 - val_accuracy: 0.9635
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1070 - accuracy: 0.9848 - val_loss: 0.4621 - val_accuracy: 0.9606
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0941 - accuracy: 0.9862 - val_loss: 0.4423 - val_accuracy: 0.9612
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0896 - accuracy: 0.9867 - val_loss: 0.4552 - val_accuracy: 0.9603
Description:
This example explains how to use Nesterov Accelerated Gradient (NAG), a variant of SGD with momentum, which looks ahead at the gradient in the direction of momentum.
# Compile the model using Nesterov Accelerated Gradient (NAG)
model.compile(optimizer=SGD(learning_rate=0.01, momentum=0.9, nesterov=True), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))
Output
Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0440 - accuracy: 0.9925 - val_loss: 0.1835 - val_accuracy: 0.9806
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0101 - accuracy: 0.9977 - val_loss: 0.1773 - val_accuracy: 0.9815
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0037 - accuracy: 0.9988 - val_loss: 0.1759 - val_accuracy: 0.9810
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0015 - accuracy: 0.9996 - val_loss: 0.1718 - val_accuracy: 0.9826
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 7.6342e-04 - accuracy: 0.9998 - val_loss: 0.1711 - val_accuracy: 0.9825
Description:
This example compares different optimizers (SGD, Adam, RMSprop) and learning rate schedules to help students understand which optimizer works best for different scenarios.
# Compile models with different optimizers
optimizers = {'SGD': SGD(), 'Adam': Adam(), 'RMSprop': RMSprop()}
# Train each model and compare performance
for opt_name, opt in optimizers.items():
print(f"Training with {opt_name} optimizer:")
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))
Output
Training with SGD optimizer:
Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 3.8336e-04 - accuracy: 1.0000 - val_loss: 0.1712 - val_accuracy: 0.9827
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 3.6744e-04 - accuracy: 1.0000 - val_loss: 0.1713 - val_accuracy: 0.9828
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 3.5512e-04 - accuracy: 1.0000 - val_loss: 0.1714 - val_accuracy: 0.9828
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 3.4474e-04 - accuracy: 1.0000 - val_loss: 0.1715 - val_accuracy: 0.9829
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 3.3602e-04 - accuracy: 1.0000 - val_loss: 0.1716 - val_accuracy: 0.9829
Training with Adam optimizer:
Epoch 1/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0077 - accuracy: 0.9980 - val_loss: 0.1870 - val_accuracy: 0.9839
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0048 - accuracy: 0.9988 - val_loss: 0.1802 - val_accuracy: 0.9836
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0051 - accuracy: 0.9988 - val_loss: 0.2021 - val_accuracy: 0.9818
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0031 - accuracy: 0.9992 - val_loss: 0.2007 - val_accuracy: 0.9836
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0026 - accuracy: 0.9995 - val_loss: 0.1965 - val_accuracy: 0.9824
Training with RMSprop optimizer:
Epoch 1/5
1875/1875 [==============================] - 7s 4ms/step - loss: 4.6306e-04 - accuracy: 0.9999 - val_loss: 0.1878 - val_accuracy: 0.9841
Epoch 2/5
1875/1875 [==============================] - 6s 3ms/step - loss: 2.3459e-05 - accuracy: 1.0000 - val_loss: 0.1901 - val_accuracy: 0.9846
Epoch 3/5
1875/1875 [==============================] - 6s 3ms/step - loss: 3.3653e-06 - accuracy: 1.0000 - val_loss: 0.1977 - val_accuracy: 0.9845
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 6.7995e-07 - accuracy: 1.0000 - val_loss: 0.1984 - val_accuracy: 0.9850
Epoch 5/5
1875/1875 [==============================] - 6s 3ms/step - loss: 7.0132e-08 - accuracy: 1.0000 - val_loss: 0.1982 - val_accuracy: 0.9851
Objective: Learn how to optimize training through the use of different optimizers, learning rate schedules, and techniques like gradient clipping and cyclical learning rates.
Skills Developed:
Understand and implement advanced optimizers (SGD with momentum, RMSprop, Adam).
Learn about learning rate schedules, such as step decay, exponential decay, and cyclical learning rates.
Apply gradient clipping to prevent exploding gradients.
Tools: TensorFlow, Keras, Optimizers (SGD, RMSprop, Adam).
These 10 examples in Week 5 provide students with hands-on experience in using advanced optimization techniques to improve model training. By experimenting with different optimizers, learning rates, and schedules, students can observe how each method influences the convergence speed and performance of their models.