In this part we will implement a full Recurrent Neural Network from scratch using Python and optimize our implementation using Theano, a library to perform operations on a GPU. ## LANGUAGE MODELINGOur goal is to build a Language Model using a Recurrent Neural Network. Here’s what that means. Let’s say we have sentence of words. A language model allows us to predict the probability of observing the sentence (in a given dataset) as: In words, the probability of a sentence is the product of probabilities of each word given the words that came before it. So, the probability of the sentence “He went to buy some chocolate” would be the probability of “chocolate” given “He went to buy some”, multiplied by the probability of “some” given “He went to buy”, and so on. Why is that useful? Why would we want to assign a probability to observing a sentence? First, such a model can be used as a scoring mechanism. For example, a Machine Translation system typically generates multiple candidates for an input sentence. You could use a language model to pick the most probable sentence. Intuitively, the most probable sentence is likely to be grammatically correct. Similar scoring happens in speech recognition systems. But solving the Language Modeling problem also has a cool side effect. Because we can predict the probability of a word given the preceding words, we are able to generate new text. It’s a Note that in the above equation the probability of each word is conditioned on ## TRAINING DATA AND PREPROCESSINGTo train our language model we need text to learn from. Fortunately we don’t need any labels to train a language model, just raw text. I downloaded 15,000 longish reddit comments from a dataset available on Google’s BigQuery. Text generated by our model will sound like reddit commenters (hopefully)! But as with most Machine Learning projects we first need to do some pre-processing to get our data into the right format. ## 1. TOKENIZE TEXTWe have raw text, but we want to make predictions on a per-word basis. This means we must ## 2. REMOVE INFREQUENT WORDSMost words in our text will only appear one or two times. It’s a good idea to remove these infrequent words. Having a huge vocabulary will make our model slow to train (we’ll talk about why that is later), and because we don’t have a lot of contextual examples for such words we wouldn’t be able to learn how to use them correctly anyway. That’s quite similar to how humans learn. To really understand how to appropriately use a word you need to have seen it in different contexts. In our code we limit our vocabulary to the ## 3. PREPEND SPECIAL START AND END TOKENSWe also want to learn which words tend start and end a sentence. To do this we prepend a special ## 4. BUILD TRAINING DATA MATRICESThe input to our Recurrent Neural Networks are vectors, not strings. So we create a mapping between words and indices, Here’s an actual training example from our text: ## BUILDING THE RNNFor a general overview of RNNs take a look at first part of the tutorial. A recurrent neural network and the unfolding in time of the computation involved in its forward computation. Let’s get concrete and see what the RNN for our language model looks like. The input will be a sequence of words (just like the example printed above) and each is a single word. But there’s one more thing: Because of how matrix multiplication works we can’t simply use a word index (like 36) as an input. Instead, we represent each word as a Let’s recap the equations for the RNN from the first part of the tutorial: I always find it useful to write down the dimensions of the matrices and vectors. Let’s assume we pick a vocabulary size and a hidden layer size . You can think of the hidden layer size as the “memory” of our network. Making it bigger allows us to learn more complex patterns, but also results in additional computation. Then we have: This is valuable information. Remember that and are the parameters of our network we want to learn from data. Thus, we need to learn a total of parameters. In the case of and that’s 1,610,000. The dimensions also tell us the bottleneck of our model. Note that because is a one-hot vector, multiplying it with is essentially the same as selecting a column of U, so we don’t need to perform the full multiplication. Then, the biggest matrix multiplication in our network is . That’s why we want to keep our vocabulary size small if possible. Armed with this, it’s time to start our implementation. ## INITIALIZATIONWe start by declaring a RNN class an initializing our parameters. I’m calling this class Above, ## FORWARD PROPAGATIONNext, let’s implement the forward propagation (predicting word probabilities) defined by our equations above: We not only return the calculated outputs, but also the hidden states. We will use them later to calculate the gradients, and by returning them here we avoid duplicate computation. Each is a vector of probabilities representing the words in our vocabulary, but sometimes, for example when evaluating our model, all we want is the next word with the highest probability. We call this function Let’s try our newly implemented methods and see an example output: For each word in the sentence (45 above), our model made 8000 predictions representing probabilities of the next word. Note that because we initialized to random values these predictions are completely random right now. The following gives the indices of the highest probability predictions for each word: ## CALCULATING THE LOSSTo train our network we need a way to measure the errors it makes. We call this the loss function , and our goal is find the parameters and that minimize the loss function for our training data. A common choice for the loss function is the cross-entropy loss. If we have training examples (words in our text) and classes (the size of our vocabulary) then the loss with respect to our predictions and the true labels is given by: The formula looks a bit complicated, but all it really does is sum over our training examples and add to the loss based on how off our prediction are. The further away (the correct words) and (our predictions), the greater the loss will be. We implement the function Let’s take a step back and think about what the loss should be for random predictions. That will give us a baseline and make sure our implementation is correct. We have words in our vocabulary, so each word should be (on average) predicted with probability , which would yield a loss of : Pretty close! Keep in mind that evaluating the loss on the full dataset is an expensive operation and can take hours if you have a lot of data! ## TRAINING THE RNN WITH SGD AND BACKPROPAGATION THROUGH TIME (BPTT)Remember that we want to find the parameters and that minimize the total loss on the training data. The most common way to do this is SGD, Stochastic Gradient Descent. The idea behind SGD is pretty simple. We iterate over all our training examples and during each iteration we nudge the parameters into a direction that reduces the error. These directions are given by the gradients on the loss: . SGD also needs a But how do we calculate those gradients we mentioned above? In a traditional Neural Network we do this through the backpropagation algorithm. In RNNs we use a slightly modified version of the this algorithm called ## GRADIENT CHECKINGWhenever you implement backpropagation it is good idea to also implement We then compare the gradient we calculated using backpropagation to the gradient we estimated with the method above. If there’s no large difference we are good. The approximation needs to calculate the total loss for |