Trang chủ‎ > ‎IT‎ > ‎Data Science - Python‎ > ‎Tensorflow‎ > ‎

weight decay, learning rate decay implementation

>> You can add all the variables you want to add weight decay to, to a collection name 'variables' and then you calculate the L2 norm weight decay for the whole collection.

  # Create your variables
  weights = tf.get_variable('weights', collections=['variables'])

  with tf.variable_scope('weights_norm') as scope:
    weights_norm = tf.reduce_sum(
      input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
          [tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
      ),
      name='weights_norm'
  )

  # Add the weight decay loss to another collection called losses
  tf.add_to_collection('losses', weights_norm)

  # Add the other loss components to the collection losses     
  # ...

  # To calculate your total loss
  tf.add_n(tf.get_collection('losses'), name='total_loss')

>> Looking at the cifar10 example and I noticed the function _variable_with_weight_decay(...) in the file cifar10.py. The code is as follows:

def _variable_with_weight_decay(name, shape, stddev, wd):
  """Helper to create an initialized Variable with weight decay.
  Note that the Variable is initialized with a truncated normal distribution.
  A weight decay is added only if one is specified.
  Args:
    name: name of the variable
    shape: list of ints
    stddev: standard deviation of a truncated Gaussian
    wd: add L2Loss weight decay multiplied by this float. If None, weight
        decay is not added for this Variable.
  Returns:
    Variable Tensor
  """
  dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
  var = _variable_on_cpu(
      name,
      shape,
      tf.truncated_normal_initializer(stddev=stddev, dtype=dtype))
  if wd is not None:
    weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
    tf.add_to_collection('losses', weight_decay)
  return var

The optimizer is defined as follows

learning_rate =  0.2
decay_rate = 0.95            
self.learning_rate_node = tf.train.exponential_decay(learning_rate=learning_rate, 
                                                    global_step=global_step, 
                                                    decay_steps=training_iters,  
                                                    decay_rate=decay_rate, 
                                                    staircase=True)

optimizer = tf.train.MomentumOptimizer(learning_rate=self.learning_rate_node).minimize(
                                                self.net.cost, global_step=global_step)

However if you need some insights on how the momentum optimizer works and how the learning rate should decay.

  1. First the Vanilla GradientDescentMinimizer's update which is the most basic:

    W^(n+1)=W^(n)-alpha*(gradient of cost wrt W)(W^n)

    You are just following the opposite of the gradient.

  2. The GradientDescentMinimizer with learning rate decay:

    W^(n+1)=W^(n)-alpha(n)*(gradient of the cost wrt W)(W^n)

The only thing that changed is the learning rate alpha , which is now dependent of the step in Tensorflow the most used is the exponential decay where after N step the learning rate is divided by some constant i.e. 10.

This change often happens later in the training so you might need to let a few epochs pass by before seeing it decay.

  1. The Momentumoptimizer: you have to keep an additional variable: the update you have done just before i.e you have to store at each time step:

    update^(n)=(W^(n)-W^(n-1))

    Then the corrected update by momentum is:

    update^(n+1)=mupdate^(n)-alpha(gradient of cost wrt W)(W^n)

So what you are doing is doing simple gradient descent but correcting it by remembering the immediate past (There are smarter and more complicated ways of doing it like Nesterov's momentum)

  1. MomentumOptimizer with learning rate decay:

update^(n)=(W^(n)-W^(n-1))

update^(n+1)=mupdate^(n)-alpha(n)(gradient of cost wrt W)(W^n)

alpha is now dependent of n too.

So at one point it will starts slowing down as in gradient descent with learning rate decay but the decrease will be affected by the momentum.

For a complete review of those methods and more you have the excellent website which explains far better than me and Alec Radford's famous visualization which is better than a thousand words.

The learning rate should not depend on the performance unless it is specified in the decay !
It would help to see the code in question !

EDIT1:: Here is a working example that I think answer both questions you asked:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

#Pure SGD
BATCH_SIZE=1
#Batch Gradient Descent
#BATCH_SIZE=1000
starter_learning_rate=0.001

xdata=np.linspace(0.,2*np.pi,1000)[:,np.newaxis]
ydata=np.sin(xdata)+np.random.normal(0.0,0.05,size=1000)[:,np.newaxis]
plt.scatter(xdata,ydata)

x=tf.placeholder(tf.float32,[None,1])
y=tf.placeholder(tf.float32, [None,1])
#We define global_step as a variable initialized at 0
global_step=tf.Variable(0,trainable=False)

w1=tf.Variable(0.05*tf.random_normal((1,100)),tf.float32)
w2=tf.Variable(0.05*tf.random_normal((100,1)),tf.float32)

b1=tf.Variable(np.zeros([100]).astype("float32"),tf.float32)
b2=tf.Variable(np.zeros([1]).astype("float32"),tf.float32)

h1=tf.nn.relu(tf.matmul(x,w1)+b1)
y_model=tf.matmul(h1,w2)+b2


L=tf.reduce_mean(tf.square(y_model-y))

#We want to decrease the learning rate after having seen all the data 5 times
NUM_EPOCHS_PER_DECAY=5
LEARNING_RATE_DECAY_FACTOR=0.1
#Since the mechanism of the decay depends on the number of iterations and not epochs we have to connect the number of epochs to the number of iterations

#So if we have batch_size=1 we have to iterate exactly 1000 times to do one epoch so 5*1000=5000 iterations before decaying if the batch_size was 1000 1 iterations=1epoch and thus we decrease it after 5 iterations 
num_batches_per_epoch=int(xdata.shape[0]/float(BATCH_SIZE))
decay_steps=int(num_batches_per_epoch*NUM_EPOCHS_PER_DECAY)
decayed_learning_rate=tf.train.exponential_decay(starter_learning_rate, 
                                                     global_step, 
                                                     decay_steps, 
                                                     LEARNING_RATE_DECAY_FACTOR,
                                                     staircase=True)

#So now we have an object that depends on global_step and that will be divided by 10 every decay_steps iterations i.e. when global_step=N*decay_steps with N a non zero integer         

#We now create a train_step to which we pass the learning rate created each time this function is called global_step will be incremented by 1 we are gonna check that it is the case BE CAREFUL WE HAVE TO GIVE IT GLOBAL_STEP AS AN ARGUMENT
train_step=tf.train.GradientDescentOptimizer(decayed_learning_rate).minimize(L,global_step=global_step)
sess=tf.Session()
sess.run(tf.initialize_all_variables())
GLOBAL_s=[]
lr_val=[]
COSTS=[]
for i in range(16000):
    #We will do 1600 iterations so as there is a decay every 5000 iterations we will see 3 decays (5000,10000,15000)
    start_data=(i*BATCH_SIZE)%1000
    COSTS.append([sess.run(L, feed_dict={x:xdata,y:ydata})])
    GLOBAL_s.append([sess.run(global_step)])
    lr_val.append([sess.run(decayed_learning_rate)])
    #I see the train_step as implicitely executing sess.run(tf.add(global_step,1))
    sess.run(train_step,feed_dict={x:xdata[start_data:start_data+BATCH_SIZE],y:ydata[start_data:start_data+BATCH_SIZE]})


plt.figure()
plt.subplot(211)
plt.plot(GLOBAL_s,lr_val,"-b")
plt.title("Evolution of learning rate" )
plt.subplot(212)
plt.plot(GLOBAL_s,COSTS,".g")
plt.title("Evolution of cost" )



#notice two things first global_step is actually being incremented and learning rate is actually being decayed

(You can writeMomentumOptimize() instead of GradientDescentOptimizer() obviously...)

Here are the two plots I get: Here is the evolution of the global_step parameter and decayed_learning_rateTo sum it up in my mind when you call train_step tensorflow runs tf.add(global_step,1)




Comments