By Jonathan Godwin, University College London.
Introduction Our brains learn to do multiple different tasks at the same time  we have the same brain architecture whether we are translating English to German or English to French. If we were to use a Machine Learning algorithm to do both of these tasks, we might call that ‘multitask’ learning. It’s one of the most interesting and exciting areas of research for Machine Learning in coming years, radically reducing the amount of data required to learn new concepts. One of the great promises of Deep Learning is that, with the power of the models and simple ways to share parameters between tasks, we should be able to make significant progress in multitask learning. As I started to experiment in this area I came across a bit of a road block  while it was easy to understand the architecture changes required to implement multitask learning, it was harder to figure out how to implement it in Tensorflow. To do anything but standard nets in Tensorflow requires a good understanding of how it works, but most of the stock examples don’t provide helpful guidance. I hope the following tutorial explains some key concepts simply, and helps those who are struggling. What We Are Going To Do
Part 2
Understanding Computation Graphs With A Toy Example There are some neat features of a graph that mean it’s very easy to conduct multitask learning, but first we’ll keep things simple and explain the key concepts. Definition: Computation Graph The Computation Graph is a template for computation (re: algorithm) you are going to run. It doesn’t perform any calculations, but it means that your computer can conduct backpropagation far more quickly. If you ask Tensorflow for a result of a calculation it will only make those calculations required for the job, not the whole graph. A Toy Example  Linear Transformation: Setting Up The Graph # Import Tensorflow and numpy import Tensorflow as tf import numpy as np # ====================== # Define the Graph # ====================== # Create Placeholders For X And Y (for feeding in data) X = tf.placeholder("float",[10, 10],name="X") # Our input is 10x10 Y = tf.placeholder("float", [10, 1],name="Y") # Our output is 10x1 # Create a Trainable Variable, "W", our weights for the linear transformation initial_W = np.zeros((10,1)) W = tf.Variable(initial_W, name="W", dtype="float32") # Define Your Loss Function Loss = tf.pow(tf.add(Y,tf.matmul(X,W)),2,name="Loss") There are a few things to emphasis about this graph:
Tip: Keep Your Graph Separate. You’ll typically be doing a fair amount of data manipulation and computation outside of the graph, which means keeping track of what is and isn’t available inside of python a bit confusing. I like to put my graph in a separate file, and often in a separate class to keep concerns separated, but this isn’t required. A Toy Example  Linear Transformation: Getting Results
# Import Tensorflow and Numpy import Tensorflow as tf import numpy as np # ====================== # Define the Graph # ====================== # Create Placeholders For X And Y (for feeding in data) X = tf.placeholder("float",[10, 10],name="X") # Our input is 10x10 Y = tf.placeholder("float", [10, 1],name="Y") # Our output is 10x1 # Create a Trainable Variable, "W", our weights for the linear transformation initial_W = np.zeros((10,1)) W = tf.Variable(initial_W, name="W", dtype="float32") # Define Your Loss Function Loss = tf.pow(tf.add(Y,tf.matmul(X,W)),2,name="Loss") with tf.Session() as sess: # set up the session sess.run(tf.initialize_all_variables()) Model_Loss = sess.run( Loss, # the first argument is the name of the Tensorflow variabl you want to return { # the second argument is the data for the placeholders X: np.random.rand(10,10), Y: np.random.rand(10).reshape(1,1) }) print(Model_Loss) How To Use Graphs for MultiTask Learning So, to start, let’s draw a diagram of a simple twotask network that has a shared layer and a specific layer for each individual task. We’re going to feed the outputs of this into our loss function with our targets. I’ve labelled where we’re going to want to create placeholders in the graph. # GRAPH CODE # ============ # Import Tensorflow import Tensorflow as tf # ====================== # Define the Graph # ====================== # Define the Placeholders X = tf.placeholder("float", [10, 10], name="X") Y1 = tf.placeholder("float", [10, 1], name="Y1") Y2 = tf.placeholder("float", [10, 1], name="Y2") # Define the weights for the layers shared_layer_weights = tf.Variable([10,20], name="share_W") Y1_layer_weights = tf.Variable([20,1], name="share_Y1") Y2_layer_weights = tf.Variable([20,1], name="share_Y2") # Construct the Layers with RELU Activations shared_layer = tf.nn.relu(tf.matmul(X,shared_layer_weights)) Y1_layer = tf.nn.relu(tf.matmul(shared_layer,Y1_layer_weights)) Y2_layer_weights = tf.nn.relu(tf.matmul(shared_layer,Y2_layer_weights)) # Calculate Loss Y1_Loss = tf.nn.l2_loss(Y1,Y1_layer) Y2_Loss = tf.nn.l2_loss(Y2,Y2_layer) When we are training this network, we want the parameters of the Task 1 layer to not change no matter how wrong we get Task 2, but the parameters of the shared layer to change with both tasks. This might seem a little difficult  normally you only have one optimiser in a graph, because you only optimise one loss function. Thankfully, using the properties of the graph it’s very easy to train this sort of model in two ways. Alternate Training Remember that Tensorflow automatically figures out which calculations are needed for the operation you requested, and only conducts those calculations. This means that if we define an optimiser on only one of the tasks, it will only train the parameters required to compute that task  and will leave the rest alone. Since Task 1 relies only on the Task 1 and Shared Layers, the Task 2 layer will be untouched. Let’s draw another diagram with the desired optimisers at the end of each task. # GRAPH CODE # ============ # Import Tensorflow and Numpy import Tensorflow as tf import numpy as np # ====================== # Define the Graph # ====================== # Define the Placeholders X = tf.placeholder("float", [10, 10], name="X") Y1 = tf.placeholder("float", [10, 20], name="Y1") Y2 = tf.placeholder("float", [10, 20], name="Y2") # Define the weights for the layers initial_shared_layer_weights = np.random.rand(10,20) initial_Y1_layer_weights = np.random.rand(20,20) initial_Y2_layer_weights = np.random.rand(20,20) shared_layer_weights = tf.Variable(initial_shared_layer_weights, name="share_W", dtype="float32") Y1_layer_weights = tf.Variable(initial_Y1_layer_weights, name="share_Y1", dtype="float32") Y2_layer_weights = tf.Variable(initial_Y2_layer_weights, name="share_Y2", dtype="float32") # Construct the Layers with RELU Activations shared_layer = tf.nn.relu(tf.matmul(X,shared_layer_weights)) Y1_layer = tf.nn.relu(tf.matmul(shared_layer,Y1_layer_weights)) Y2_layer = tf.nn.relu(tf.matmul(shared_layer,Y2_layer_weights)) # Calculate Loss Y1_Loss = tf.nn.l2_loss(Y1Y1_layer) Y2_Loss = tf.nn.l2_loss(Y2Y2_layer) # optimisers Y1_op = tf.train.AdamOptimizer().minimize(Y1_Loss) Y2_op = tf.train.AdamOptimizer().minimize(Y2_Loss) We can conduct MultiTask learning by alternately calling each task optimiser, which means we can continually transfer some of the information from each task to the other. In a loose sense, we are discovering the ‘commonality’ between the tasks. The following code implements this for our easy example. If you are following along, paste this at the bottom of the previous code: # Calculation (Session) Code # ========================== # open the session with tf.Session() as session: session.run(tf.initialize_all_variables()) for iters in range(10): if np.random.rand() < 0.5: _, Y1_loss = session.run([Y1_op, Y1_Loss], { X: np.random.rand(10,10)*10, Y1: np.random.rand(10,20)*10, Y2: np.random.rand(10,20)*10 }) print(Y1_loss) else: _, Y2_loss = session.run([Y2_op, Y2_Loss], { X: np.random.rand(10,10)*10, Y1: np.random.rand(10,20)*10, Y2: np.random.rand(10,20)*10 }) print(Y2_loss) Tips: When is Alternate Training Good? Alternate training is a good idea when you have two different datasets for each of the different tasks (for example, translating from English to French and English to German). By designing a network in this way, you can improve the performance of each of your individual tasks without having to find more taskspecific training data. Alternate training is the most common situation you’ll find yourself in, because there aren’t that many datasets that have two or more outputs. We’ll come on to one example, but the clearest examples are where you want to build hierarchy into your tasks. For example, in vision, you might want one of your tasks to predict the rotation of an object, the other what the object would look like if you changed the camera angle. These two tasks are obviously related  in fact the rotation probably comes before the image generation. Tips: When is Alternate Training Less Good? Alternate training can easily become biased towards a specific task. The first way is obvious  if one of your tasks has a far larger dataset than the other, then if you train in proportion to the dataset sizes your shared layer will contain more information about the more significant task. The second is less so. If you train alternately, the final task in your model will create a bias in the parameters. There isn’t any obvious way that you can overcome this problem, but it does mean that in circumstances where you don’t have to train alternately, you shouldn’t. Training at the Same Time  Joint Training # GRAPH CODE # ============ # Import Tensorflow and Numpy import Tensorflow as tf import numpy as np # ====================== # Define the Graph # ====================== # Define the Placeholders X = tf.placeholder("float", [10, 10], name="X") Y1 = tf.placeholder("float", [10, 20], name="Y1") Y2 = tf.placeholder("float", [10, 20], name="Y2") # Define the weights for the layers initial_shared_layer_weights = np.random.rand(10,20) initial_Y1_layer_weights = np.random.rand(20,20) initial_Y2_layer_weights = np.random.rand(20,20) shared_layer_weights = tf.Variable(initial_shared_layer_weights, name="share_W", dtype="float32") Y1_layer_weights = tf.Variable(initial_Y1_layer_weights, name="share_Y1", dtype="float32") Y2_layer_weights = tf.Variable(initial_Y2_layer_weights, name="share_Y2", dtype="float32") # Construct the Layers with RELU Activations shared_layer = tf.nn.relu(tf.matmul(X,shared_layer_weights)) Y1_layer = tf.nn.relu(tf.matmul(shared_layer,Y1_layer_weights)) Y2_layer = tf.nn.relu(tf.matmul(shared_layer,Y2_layer_weights)) # Calculate Loss Y1_Loss = tf.nn.l2_loss(Y1Y1_layer) Y2_Loss = tf.nn.l2_loss(Y2Y2_layer) Joint_Loss = Y1_Loss + Y2_Loss # optimisers Optimiser = tf.train.AdamOptimizer().minimize(Joint_Loss) Y1_op = tf.train.AdamOptimizer().minimize(Y1_Loss) Y2_op = tf.train.AdamOptimizer().minimize(Y2_Loss) # Joint Training # Calculation (Session) Code # ========================== # open the session with tf.Session() as session: session.run(tf.initialize_all_variables()) _, Joint_Loss = session.run([Optimiser, Joint_Loss], { X: np.random.rand(10,10)*10, Y1: np.random.rand(10,20)*10, Y2: np.random.rand(10,20)*10 }) print(Joint_Loss) Conclusions and Next Steps For those of you who want a more meaty, more detailed example of how this can be used to improve performance in multiple tasks, then stay tuned for part 2 of the tutorial where we’ll delve into Natural Language Processing to build a multitask model for shallow parsing and part of speech tagging. Bio: Jonathan Godwin is currently studying for a Msc in Machine Learning from UCL with a specialism in deep multitask learning for NLP. He will be finishing in September and will be looking for jobs/research roles where he can use this skill set on interesting problems. Original. Reposted with permission. Related:
 Keras with the TensorFlow backend can easily do this. The functional API was designed for these use cases. Take a look at the functional API guide. Here is an LSTM example that shared layers, taken from the above guide:
 There is an similar question here, the answer used keras. It's similar when just using tensorflow. The idea is this: we can define multiple outputs of a network, and thus multiple loss functions (objectives). We then tell optimizer to minimize a combined loss function, usually using a linear combination. This diagram is drawn according to this paper. Let's say we are training a classifier that predict the digit in the image, with maximum 5 digits per image. Here we defined 6 output layer: Now to train it, we just add up all the cross entropy loss of each softmax function:
