The purpose of this section is to introduce students to the new Keras API. As of writing this section, Keras will continue to develop the library only for use with the TensorFlow backend. This means that out of convenience (and in order to ensure that the guide remains up-to-date for the foreseeable future), only the new Keras API will be covered in the context of TensorFlow.
This section will cover some introductory information regarding the TensorFlow system and how it functions (ie. execution flow / optimizations / purpose). This section will also serve a as a motivating factor for the use of a higher level API such as Keras and explain some of the pitfalls of immediately resorting to such libraries. Finally, if so desired, students will learn some of the features that TensorFlow has to offer for visualization such as the use of TensorBoard.
“TensorFlow™ is an open source software library for high performance numerical computation. […] it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains.”
TensorFlow is a system for designing computation graphs for execution on various components that support computation on a typical computer. A large part of TensorFlow consists of properly allocating space on each device and allocating functions that are optimally performed on each device in order to save computation time. While TensorFlow used to be quite verbose, the general ideal of TensorFlow has shifted from detailed generation of computation graphs to something that anybody with a little programming experience can set up.
In particular, it has been used for almost any type of machine learning algorithm as its automatic differentiation scheme gives it a convenience factor over some other computational graph implementations that do not share such a feature. While this project will not be using TensorFlow extensively in-depth (since the purpose is not research in the area of systems and/or machine learning algorithms), you will use some of the features that make TensorFlow more convenient for beginners over similar libraries like PyTorch.
As it turns out, all operations in TensorFlow can be seen as forming a dataflow graph. A controller partitions and divides the graph with respect to available compute devices (CPU/GPU/TPU) and controls all execution requests. Here we won't be focusing on the way TensorFlow functions as a system, but it's useful to know that the graph generated by TensorFlow is dynamically optimized depending on the requests made at any point in time. This dynamic optimization of graphs allowed these types of machine learning libraries to be successful due to their performance. Since the controller keeps track of and divides all operations, it was also possible to create a backwards flow of the model depending on the types of functions used at each point in the control graph and thereby implement an opposite direction gradient graph for use in the backpropagation of a model.
Clearly, with any graph there are some obvious optimizations that can be made by pruning and caching specific nodes in the context of a given request. TensorFlow is designed with these types of optimizations in mind and dynamically rewrites graphs into more optimized ones designed for the hardware in question. This ensures that there is minimal wasted compute and static memory, and it also makes sure that the hardware is used to its fullest extent. (For example, half-precision float usage in Pascal NVIDIA GPUs.) This isn't a project on data-intensive computing and systems design, however, so further details will not be covered. Just know that the TensorFlow framework is designed to be high-level and optimize high-level requests much like its similarly constructed but different goal counterpart: Apache Spark.
TensorFlow at its core is already extremely useful for the normal user given the way it emphasized a more programmer-oriented manner of designing machine learning models. This gave it a slight edge over frameworks such as Caffe, PyTorch, and CNTK that focused more-so on explicit and optimized execution, but it also meant that performance would clearly take a small hit (after all, dynamic optimization of a control flow graph can only do so much). After some time, even this programmer-oriented style seemed to generate a lot of problems with those who were not already used to the TensorFlow API and general network creation process. As it turns out, this sheer verbosity and specific style made it prohibitively difficult to understand even as a higher-level library when compared to others! For those who are curious, consider the following snippet of a typical TensorFlow <1.0 model's training procedure:
This is where Keras comes in to help the developer. Keras calls itself "an API designed for human beings, not machines". The reason for such a claim comes from the straightforward API and programming paradigms. There are actually quite a few ways of working with Keras, and all of them are perfectly grounded in a more programmer-friendly manner. That is, execution loops look exactly like traditional loops with clear points of execution and written code is understandable with obvious points of execution. With the release of Keras 2, Keras has become even more integrated into TensorFlow, thereby allowing all of the necessary low-level operations if necessary. In fact, the creators of TensorFlow highly recommend working with the Keras abstraction unless you need to deal with lower level operations!
TensorBoard is a tool that serves to visualize training data and network aspects during the training process and after training. The purpose of it was to be able to analyze the model data remotely as TensorBoard is generally launched on an open port. Here, we will be using TensorBoard locally to analyze a very simple network so as to get some insight as to how the tool works. I suggest reading the sections below on how to use Keras before covering this part as it does depend on being able to generate the network without the TensorBoard callbacks first.
To start, let us consider the ResNet network that was made in the bottom. We can declare the necessary callbacks once the model is ready for compilation through an additional keyword argument. This can be seen in the image below this collapsible block.
This sets up a directory ./tensorboard for the callback to use to store the desired values. By default, TensorBoard stores the measured losses and metrics as indicated in the compilation in the function. If you want to evaluate other things (such as store values into variables), then you will have to set up callbacks for these types of analytics by either using the tf.Summary methods to manually save scalar values, or by changing the default arguments of the TensorBoard callback, which can be found here.
If you are interested in saving a particular value (or being able visualize projected data or something along those lines), I suggest consulting the API linked above or asking the TA for help on how to do so. Note that in order to actually visualize this information, you will have to open a new terminal for TensorBoard to run on with the following command:
tensorboard --logdir '/path/to/log/dir'
Afterward, you can access the TensorBoard page at the default page. (If this link doesn't work, simply navigate to localhost:$PORT with the appropriate port into your URL bar.)
This section will cover the Keras API and how networks are built using the two most common design paradigms: functional and sequential. Both of these paradigms will be analyzed and covered, but since this project will be using the functional style, most examples will be given in this format. The parts of the Keras API that will be covered most in-depth will be those required in order to build a CNN, but an introduction to the methods used for vanilla NNs will also be covered for students who seek to work on a model for their final project. If you wish to work on a project that requires the usage of an RNN, I suggest consulting the TensorFlow example page as it explains a lot of the theory behind RNNs and how the models are built in a manner better than can be done here without prior exposure.
There are two different ways of designing neural networks (one of which can be extended to the sub-classing method). Here, we will cover both of these programming paradigms and allude only slightly to the sub-classing method of model creation in Keras. The first type of model creation paradigm to cover is the sequential style. As is suggested by the word "sequential", the sequential style creates a model using a series of layers which are sequential in order and processed from the beginning to the end. In other words, this sequential style forms a queue of functors that are then evaluated with inputs and outputs automatically linked in the order they are queued. The types of graphs that can be created by this type of paradigm can be seen in the figure below, along with the code necessary to generate the graph shown.
As you can see, the code is very straightforward, but because of the inherent data structure it mimics (a queue), it's not useful for generating models that include functions with multiple inputs. On the other hand, while the functional paradigm is inherently more programmer-friendly, it is clearly more verbose and difficult to understand on first sight. Recall the discussion on functors found on the Python introduction that covered function objects. The functional paradigm uses functor calls (as opposed to adding them to a queue) as the main way of designing networks. Because of the ability to control where the output goes with respect to the input, it allows you to implement any ML model that has a control flow that can be modeled as a DAG (directed acyclic graph). Two perfect examples where this would become useful would be networks such as ResNet or Inception where inputs to a functor are simply a concatenation of multiple outputs. See below for how the same model as above would be created in the functional paradigm.
While the functional paradigm is already immensely more useful due to its ability to create more general networks, the problem with simply laying out the network in the global namespace makes it prohibitively difficult to incorporate it into new code or modify in blocks. Naturally, we would want our models to be a class that can be instantiated whenever we need the model. This makes the network reusable and often allows for a more easy form of debugging and testing. For example, the above model can be embedded into a class as follows (note that it sub-classes the tf.Model class):
However, despite the fact that the sub-classing method feels more natural for a typical programmer, the control flow is not defined in the same sense as a sequential or non-subclassing functional model. In fact, if you try to save the model, analyze it, or do something that requires prior knowledge of the model (which can be done with static model), then everything comes falling apart. The problem comes from the fact that the call is what determines the forward pass, but it's not obvious what the call will be unless it is executed (an implicit call). As a result, it becomes a pain to analyze layers without dealing with class functions, and it becomes more difficult to deal with the inner parts of the model itself. As an example, consider the summary printout for the subclass model in Figure 6 and compare it to the summary for the functional method as seen in Figure 5. Remember that should you decide to follow the sub-classing method as a programmer.
This section will briefly cover the most common layers used in particular networks and their arguments along with how they are used. Of course, you are not required to use the following when creating your network. Everything can be conveniently found in this page on the core layers, this page on the convolutional layers, and this page on the pooling layers. It is expected that students recognize what each layer does in theory by this point.
One thing to keep in mind is that there are actually many ways to define inputs and outputs when using TensorFlow. For now, just assume that the only way of declaring an input will be through Keras' input function Keras.Input(), and models will be declared using Keras' model wrapping function Keras.Model(inputs, outputs). This should at least reduce the amount of confusion that would naturally come from mixing TensorFlow terms such as placeholders with their Keras wrappers.
A fully connected layer is known in Keras as a dense layer. The name comes from the fact that the connections between the neurons in the new and previous layer are highly connected. In order to create a MLP, then, requires a chaining of dense layers with the appropriate arguments. The only other layer which may come into use when designing a network is the dropout layer, which occasionally drops a random selection of neurons with a specific probability. This encourages the re-learning of major representations which promotes a sparse representation overall as the overfit features would lead to incorrect predictions and be changed.
Dense Layer
Initialization
keras.layers.Dense(units, activation=None, use_bias=True)
Arguments
units: Used to determine the number of neurons for this particular layer
activation: Used to immediately follow up the layer with an activation function from this page. The most common are "relu", "softmax", "tanh", and "sigmoid".
Dropout Layer
Initialization
keras.layers.Dropout(rate)
Arguments
rate: Determines the percentage of neurons that are dropped from the preceding layer. Can be any number between 0 and 1.
Activation Layers
Note
Either follow the functional (Layer Init -> Layer Use) procedure or immediately call the activation functions with the input as an argument from this page. Note that the alternative method returns the actual value placeholder and not a function to be called as in the functional method.
Initialization
keras.layers.Activation(activation)
Arguments
activation: Use the name of the activation layer desired for use.
Alternative Method:
Function Call
(activation)(x)
IE. dense(x) or relu(x)
Arguments
x: The input that the activation function should be applied to
Returns
A placeholder for the actual value of the function after the activation
Recall that the creation of a CNN consists of building a feature extractor using convolutional layers and then a classifier or regressor using a simple MLP (or any other ML algorithm that can perform a classification/regression). The directions for how to create the MLP section have already been covered above. Instead, this section will only introduce the layers required for building the convolutional feature extractor. Most commonly the feature extractor is topped off with a flatten layer in order to arrange the neurons into a one-dimensional shape, but there have been some interesting networks that use the resultant two-dimensional extracted features for other purposes.
Convolutional Layer
Initialization
keras.layers.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', activation=None, use_bias=True)
Arguments
filters: The number of filters to learn for convolutional layer. Determines the number of output channels (one for each filter).
kernel_size: An integer corresponding to the size of the filter to be trained. This should ideally be an odd number larger than one in most cases.
strides: A tuple corresponding to the "jump" between each convolution as the kernel slides across the image. One integer for each dimension of the image.
padding: Determines the type of convolution to be performed dependent on the padding. For example, the "valid" specifier means no padding is used, while the "same" specifier means enough padding is added to ensure the output height and width remain the same.
activation: As per usual, allows for the immediate application of an activation layer following the convolutional layer.
use_bias: Determines whether or not a bias should be applied after the convolution takes place. By default, this should always be true.
Max/Average Pooling Layer
Initialization
keras.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='valid')
keras.layers.AveragePooling2D(pool_size=(2, 2), strides=None, padding='valid')
Arguments
pool_size: The size of the window that slides over the image for which the pooling operation acts on.
strides: A tuple corresponding to the number of pixels between each window jump. Usually this is implicitly derived from pool_size such that the windows do not overlap.
padding: As with the convolutional layer, determines the amount of padding added to either preserve the size of the image on the output ("same") or perform the operation on the input without any padding ("valid").
Flatten Layer
Note
While the dense layer automatically flattens inputs with rank greater than two, I highly suggest being explicit with your calls so as to leave no doubt as to the output of your layers.
Initialization
keras.layers.Flatten()
On the topic of layer functions, there seems to be a function called SpatialDropout2D that seems to drop entire feature maps to sort-of recreate the effects of dropout in convolutional networks. I have actually never seen the effects of such a dropout on a convolutional network, but you are free to test how it works if you so desire.
The use of loss functions and optimizers for which to optimize the network are central to the usage of any neural network. While information on the layers that follow can be found on this page (losses) and this page (optimizers), here are some of the more common ones along with a brief description of their usage if necessary in the case of the loss functions.
One thing to note is that the most common way that losses are defined is simply by name once the model is about to be compiled. Although it is possible to declare functions and pass it to the compile method, I suggest relying on the predefined functions unless you know what you are doing. Since loss functions are very dependent on the type of data used, this will focus less on the function declaration and more on the usage.
Binary Cross Entropy
Usage: Used for two class classification problems. An optimized version of the general cross entropy loss.
Compile Keyword: "binary_crossentropy"
Categorical Cross Entropy
Usage: Used for K-class classification problems. It is possible for classes to overlap in this scenario.
Compile Keyword: "categorical_crossentropy"
Sparse Categorical Cross Entropy
Usage: Used for K-class classification problems. The only caveat from the above is that images must belong to only a single class.
Compile Keyword: "sparse_categorical_crossentropy"
Mean Squared Error
Usage: The typical L2 loss. Used for regressive tasks.
Compile Keyword: "mse"
Mean Absolute Error
Usage: The typical L1 loss. Used for regressive tasks with many outliers that aren't significant.
Compile Keyword: "mae"
Mean Squared Logarithmic Error
Usage: A modified version of the L2 loss where the logarithm of the values are taken before their difference. This reduces the impact of outliers during training.
Compile Keyword: "msle"
Kullback-Leibler Divergence
Usage: Measures the "distance" between two probability densities. Most often this means the distance between the true density and the estimated density.
Initialization
keras.losses.KLDivergence()
Hinge Loss
Usage: Used for binary classifications to emphasize the separation of classes over the general probability of them occurring.
Compile Keyword: "hinge"
Adam (The most commonly used optimizer)
Initialization
keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False)
Arguments
learning_rate: The rate at which the gradients are optimized. This is usually around the range of [0.001, 0.1] and varies depending on the network and data used.
beta_1 / beta_2: Parameters which decide the overall decay rate of the learning rate. These should usually be left alone unless you know what you're doing.
epsilon: A protective measure to ensure that the velocity gradient adjustment does not produce a division by zero.
amsgrad: A modified version of the velocity adjustment. This lessens the impact of ill-chosen hyperparameters, but I suggest you leave it off as described here.
SGD (The most basic stochastic gradient descent optimizer)
Initialization
keras.optimizers.SGD(learning_rate=0.01, momentum=0.0, nesterov=False)
Arguments
learning_rate: As seen above, the rate at which the gradients are optimized. The value used here is about the same.
momentum: If momentum-supported updates are required, this determines the amount that the momentum contributes to the overall gradient update
nesterov: A modified version of the momentum update where the momentum is calculated from the initial predicted point instead of the starting gradient point.
There is actually a nice explanation of all types of optimizers on this page. All that is needed to understand how all of the optimizers work is some simple knowledge of how gradients function on a general surface. It's quite interesting to read as it does tell you which optimizer is most likely to work best, but as of recently Adam and its variants are definitely the most used due to their performance on most training tasks.
Exporting your model (or at least the weights) is one of the most important parts of any project. The reason for this is because the moment that TensorFlow is closed, the memory taken up by the weights and model graph representation becomes freed by the underlying process. In general, this also means that saving the weights every couple of epochs is a good practice, but collecting a significant number of these sets of weights can occupy too large of a memory footprint. Finding a balance between saving previous model weights and deleting older ones is something many will take into account when training networks that may take hours (or even days) per epoch. There are functions for saving only the model or weights, but they will not be covered as they will likely cause confusion later in the project.
load_model
Usage
keras.models.load_model(filename, custom_objects)
Arguments
filename: The name of the input model file. It's useful to have an explicit extension (.h5), but it's not necessary for proper saving/loading of models.
custom_objects: The realized implementation of custom classes used within the model
Note: If your model uses non-internal classes in its implementation, then you will have to define the class separately for it to function properly once loaded.
save_model
Usage
keras.models.save_model(filename)
Arguments
filename: The name of the output file. Again, it's useful to have the extension, but not necessary for proper model saving.
The fundamental feature of ResNet that distinguishes it from other networks are the pass-through connections made between each block of convolution and pooling operations. This pass-through connection is commonly referred to as a skip connection (or residual connection). To see why the network is referred to as a skip connection, one might imagine condensing all of the non-skip operations into a single feature block F(x) and then modeling the flow of inputs from there. This type of modeling can be seen in Fig. 7. Notice that this network essentially functions as a sort of "bus". For every residual block, the network decides whether or not the block should modify the input/gradient passing through. If block is allowed to modify the input, then Fn(x) is set to be some non-trivial function and the gradient must thereby be non-zero. However, if a simple block is unneeded, then it is simply set to the zero function (F(x)=0). This means that the network, upon training, learns to figure out which combination of these blocks must be used in order to learn some optimal function F*(x). From this earlier explanation, it is easier to understand why the use of residual blocks is often referred to as a residual learning algorithm. Suppose we have some function we wish to model, T(x). If we consider an infinite sequence of residual blocks, then we can imagine our optimal function to take on some form, F*(x) = x + f1(x) + f2(x) + f3(x) + ..., which in turn gives us some residual, R(x) = T(x) - x - f1(x) - f2(x) - ... = T*(x) - Σfn(x) . It is exactly this property that makes it seem like the network learns the residuals. Since x is passed through the model without any interference, what the network learns is not the function T(x) itself, but rather the difference T*(x) = T(x)-x, or the residual of T(x) relative to the identity function x!
The interpretation above is the most common interpretation for how ResNet functions. However, that isn't to say that it's the only one. While the above gives us the context needed to make sense for the name of the network, it's a bit harder to visualize why the makers believed it would work so well. In fact, nothing about the original explanation gives any user an idea of why the network performed well even with 1000 layers. In order to analyze this behavior, another view of ResNet was needed that didn't make assumptions about the input features. In another interpretation, ResNet could be viewed as an ensemble of shallower networks much like in Fig. 8. From this view, we can see that a function can choose to travel an exponentially growing number of paths. It also essentially meant that, much like any other ensemble, it would thereby be possible to remove any layer and still have some semblance of a proper decision function left. [This is a feature lacking from any single path feed-forward network.] In fact, the testing that the paper by Veit et al. performed showed not only this, but several other characteristics that were more akin to properties of an ensemble than a typical neural network.
In the following example, we will be recreating a smaller version of the ResNet network. Since the ResNet networks are made from the same residual blocks that are concatenated with the input, it can easily be adapted to make smaller versions of the same network. In this case, we'll only use a few blocks to illustrate the point. To start off, we have two different types of blocks and some prepended value in the original ResNet network. The entire network along with the two subblocks that comprise it can be seen in the images below.
Given the network and the downsizing convolutional layers, we can think of this network as consisting of two residual blocks: one which incorporates a downsizing convolution (indicated by the dashed lines), and one which doesn't (indicated by the solid lines). Considering both networks all contain two layers with varying output channel sizes, we can either choose to make a single function with a flag to determine the convolutional downsizing on the residual connection, or we can make two functions to take care of both explicitly. For this particular implementation, we will choose to make two different functions. To make the distinction between both blocks, we will call the functions resBlockSame and resBlockDownsized to make it clear whether the output is the same dimension as the input or downsized.
We design the functions as seen below:
Finally, we put the entire network together, keeping in mind the overall network architecture. For convenience, the input size and the output dimensions are defined as constants above the defined functions. Overall, the network code looks as follows:
With this, we have fully implemented a ResNet32 network. Ideally, this should be contained within a function so that the model can be easily used with varying input and output sizes, but that's irrelevant for this model building example. We will be using this network with the CIFAR 10 dataset, which is easily loaded with a TensorFlow built-in function. In order to fit the model on the images, we also need an optimizer. The Adam optimizer is the most commonly used optimizer, but other ones can be seen in the optimizers section. Since we are dealing with a classification problem, the natural loss to use is a categorical cross-entropy loss. The sparse in the function used here simply means that the images can only belong to a single class. That allows some optimizations to be made when evaluating the loss across all the images. The network is then trained for 10 epochs and the test accuracy is evaluated at each epoch with the following code:
And that's it! This is the end of the concrete example. Combining all this code should give you a python file that can be executed in order to train a network successfully. Keep in mind that this example was convenient because the training and testing images and labels were loaded by an internal function. However, in your project, you will need to load in the images into a numpy array on your own.
This example actually will not take too much memory given the small image inputs, but it will take a while to train with this learning rate so I suggest implementing the smaller version if you are truly interested in seeing how the network works!
Keras API - The Keras API page.
TensorFlow API - See this page on the TensorFlow.Keras module and notice that it is equivalent to the Keras module! Feel free to use either. A bit less explanatory in some ways; especially so for methods that are simply an interface on top of simple functions. (Much like how the activation functions have a functor interface.)
Keras Examples Repository - Implements A LOT of basic neural networks and even some more advanced ones. If you don't want to create the network from scratch, this is usually a good starting spot that is kept up to date with the most current version of Keras.
Residual Networks Behave as an Ensemble - Paper that describes how ResNet behaves very much like how a typical ensemble classifier functions.