Summary: I learn best with toy code that I can play with. This tutorial teaches backpropagation via a very simple toy example, a short python implementation. Edit: Some folks have asked about a followup article, and I'm planning to write one. I'll tweet it out when it's complete at @iamtrask. Feel free to follow if you'd be interested in reading it and thanks for all the feedback! ## Just Give Me The Code:`01.` `X ` `=` `np.array([ [` `0` `,` `0` `,` `1` `],[` `0` `,` `1` `,` `1` `],[` `1` `,` `0` `,` `1` `],[` `1` `,` `1` `,` `1` `] ])` `02.` `y ` `=` `np.array([[` `0` `,` `1` `,` `1` `,` `0` `]]).T` `03.` `syn0 ` `=` `2` `*` `np.random.random((` `3` `,` `4` `)) ` `-` `1` `04.` `syn1 ` `=` `2` `*` `np.random.random((` `4` `,` `1` `)) ` `-` `1` `05.` `for` `j ` `in` `xrange(` `60000` `):` `06.` `l1 ` `=` `1` `/` `(` `1` `+` `np.exp(` `-` `(np.dot(X,syn0))))` `07.` `l2 ` `=` `1` `/` `(` `1` `+` `np.exp(` `-` `(np.dot(l1,syn1))))` `08.` `l2_delta ` `=` `(y ` `-` `l2)` `*` `(l2` `*` `(` `1` `-` `l2))` `09.` `l1_delta ` `=` `l2_delta.dot(syn1.T) ` `*` `(l1 ` `*` `(` `1` `-` `l1))` `10.` `syn1 ` `+` `=` `l1.T.dot(l2_delta)` `11.` `syn0 ` `+` `=` `X.T.dot(l1_delta)` Other Languages: D However, this is a bit terse…. let’s break it apart into a few simple parts. ## Part 1: A Tiny Toy NetworkA neural network trained with backpropagation is attempting to use input to predict output. Consider trying to predict the output column given the three input columns. We could solve this problem by simply measuring statistics between the input values and the output values. If we did so, we would see that the leftmost input column is ## 2 Layer Neural Network:`01.` `import` `numpy as np` `02.` `03.` `# sigmoid function` `04.` `def` `nonlin(x,deriv` `=` `False` `):` `05.` `if` `(deriv` `=` `=` `True` `):` `06.` `return` `x` `*` `(` `1` `-` `x)` `07.` `return` `1` `/` `(` `1` `+` `np.exp(` `-` `x))` `08.` `09.` `# input dataset` `10.` `X ` `=` `np.array([ [` `0` `,` `0` `,` `1` `],` `11.` `[` `0` `,` `1` `,` `1` `],` `12.` `[` `1` `,` `0` `,` `1` `],` `13.` `[` `1` `,` `1` `,` `1` `] ])` `14.` `15.` `# output dataset ` `16.` `y ` `=` `np.array([[` `0` `,` `0` `,` `1` `,` `1` `]]).T` `17.` `18.` `# seed random numbers to make calculation` `19.` `# deterministic (just a good practice)` `20.` `np.random.seed(` `1` `)` `21.` `22.` `# initialize weights randomly with mean 0` `23.` `syn0 ` `=` `2` `*` `np.random.random((` `3` `,` `1` `)) ` `-` `1` `24.` `25.` `for` `iter ` `in` `xrange(` `10000` `):` `26.` `27.` `# forward propagation` `28.` `l0 ` `=` `X` `29.` `l1 ` `=` `nonlin(np.dot(l0,syn0))` `30.` `31.` `# how much did we miss?` `32.` `l1_error ` `=` `y ` `-` `l1` `33.` `34.` `# multiply how much we missed by the` `35.` `# slope of the sigmoid at the values in l1` `36.` `l1_delta ` `=` `l1_error ` `*` `nonlin(l1,` `True` `)` `37.` `38.` `# update weights` `39.` `syn0 ` `+` `=` `np.dot(l0.T,l1_delta)` `40.` `41.` `print` `"Output After Training:"` `42.` `print` `l1` Output After Training: [[ 0.00966449] [ 0.00786506] [ 0.99358898] [ 0.99211957]] As you can see in the "Output After Training", it works!!! Before I describe processes, I recommend playing around with the code to get an intuitive feel for how it works. You should be able to run it "as is" in an ipython notebook (or a script if you must, but I HIGHLY recommend the notebook). Here are some good places to look in the code: • Compare l1 after the first iteration and after the last iteration. • Check out the "nonlin" function. This is what gives us a probability as output. • Check out how l1_error changes as you iterate. • Take apart line 36. Most of the secret sauce is here. • Check out line 39. Everything in the network prepares for this operation. Let's walk through the code line by line. Recommendation: open this blog in two screens so you can see the code while you read it. That's kinda what I did while I wrote it. :) Line 01: This imports numpy, which is a linear algebra library. This is our only dependency. Line 04: This is our "nonlinearity". While it can be several kinds of functions, this nonlinearity maps a function called a "sigmoid". A sigmoid function maps any value to a value between 0 and 1. We use it to convert numbers to probabilities. It also has several other desirable properties for training neural networks. Line 05: Notice that this function can also generate the derivative of a sigmoid (when deriv=True). One of the desirable properties of a sigmoid function is that its output can be used to create its derivative. If the sigmoid's output is a variable "out", then the derivative is simply out * (1-out). This is very efficient. Line 10: This initializes our input dataset as a numpy matrix. Each row is a single "training example". Each column corresponds to one of our input nodes. Thus, we have 3 input nodes to the network and 4 training examples. Line 16: This initializes our output dataset. In this case, I generated the dataset horizontally (with a single row and 4 columns) for space. ".T" is the transpose function. After the transpose, this y matrix has 4 rows with one column. Just like our input, each row is a training example, and each column (only one) is an output node. So, our network has 3 inputs and 1 output. Line 20: It's good practice to seed your random numbers. Your numbers will still be randomly distributed, but they'll be randomly distributed in exactly the same wayeach time you train. This makes it easier to see how your changes affect the network. Line 23: This is our weight matrix for this neural network. It's called "syn0" to imply "synapse zero". Since we only have 2 layers (input and output), we only need one matrix of weights to connect them. Its dimension is (3,1) because we have 3 inputs and 1 output. Another way of looking at it is that l0 is of size 3 and l1 is of size 1. Thus, we want to connect every node in l0 to every node in l1, which requires a matrix of dimensionality (3,1). :) Line 25: This begins our actual network training code. This for loop "iterates" multiple times over the training code to optimize our network to the dataset. Line 28: Since our first layer, l0, is simply our data. We explicitly describe it as such at this point. Remember that X contains 4 training examples (rows). We're going to process all of them at the same time in this implementation. This is known as "full batch" training. Thus, we have 4 different l0 rows, but you can think of it as a single training example if you want. It makes no difference at this point. (We could load in 1000 or 10,000 if we wanted to without changing any of the code). Line 29: This is our prediction step. Basically, we first let the network "try" to predict the output given the input. We will then study how it performs so that we can adjust it to do a bit better for each iteration. Line 32: So, given that l1 had a "guess" for each input. We can now compare how well it did by subtracting the true answer (y) from the guess (l1). l1_error is just a vector of positive and negative numbers reflecting how much the network missed. Line 36: Now we're getting to the good stuff! This is the secret sauce! There's a lot going on in this line, so let's further break it into two parts. ## First Part: The Derivative`1.` `nonlin(l1,` `True` `)` If l1 represents these three dots, the code above generates the slopes of the lines below. Notice that very high values such as x=2.0 (green dot) and very low values such as x=-1.0 (purple dot) have rather shallow slopes. The highest slope you can have is at x=0 (blue dot). This plays an important role. Also notice that all derivatives are between 0 and 1. ## Entire Statement: The Error Weighted Derivative`1.` `l1_delta ` `=` `l1_error ` `*` `nonlin(l1,` `True` `)` There are more "mathematically precise" ways than "The Error Weighted Derivative" but I think that this captures the intuition. l1_error is a (4,1) matrix. nonlin(l1,True) returns a (4,1) matrix. What we're doing is multiplying them "elementwise". This returns a (4,1) matrix l1_delta with the multiplied values. Line 39: We are now ready to update our network! Let's take a look at a single training example.In this training example, we're all setup to update our weights. Let's update the far left weight (9.5). However, because we're using a "full batch" configuration, we're doing the above step on all four training examples. So, it looks a lot more like the image above. So, what does line 39 do? It computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line. Play around with the matrix multiplication and you'll see it do this! ## Takeaways:So, now that we've looked at how the network updates, let's look back at our training data and reflect. When both an input and a output are 1, we increase the weight between them. When an input is 1 and an output is 0, we decrease the weight between them.Thus, in our four training examples below, the weight from the first input to the output would consistently increment or remain unchanged, whereas the other two weights would find themselves both increasing and decreasing across training examples (cancelling out progress). This phenomenon is what causes our network to learn based on correlations between the input and output. ## Part 2: A Slightly Harder ProblemConsider trying to predict the output column given the two input columns. A key takeway should be that neither columns have any correlation to the output. Each column has a 50% chance of predicting a 1 and a 50% chance of predicting a 0. So, what's the pattern? It appears to be completely unrelated to column three, which is always 1. However, columns 1 and 2 give more clarity. If either column 1 or 2 are a 1 (but not both!) then the output is a 1. This is our pattern. This is considered a "nonlinear" pattern because there isn't a direct one-to-one relationship between the input and output. Instead, there is a one-to-one relationship between a combination of inputs, namely columns 1 and 2. Believe it or not, image recognition is a similar problem. If one had 100 identically sized images of pipes and bicycles, no individual pixel position would directly correlate with the presence of a bicycle or pipe. The pixels might as well be random from a purely statistical point of view. However, certain combinations of pixels are not random, namely the combination that forms the image of a bicycle or a person. ## Our StrategyIn order to first combine pixels into something that can then have a one-to-one relationship with the output, we need to add another layer. Our first layer will combine the inputs, and our second layer will then map them to the output using the output of the first layer as input. Before we jump into an implementation though, take a look at this table. If we randomly initialize our weights, we will get hidden state values for layer 1. Notice anything? The second column (second hidden node), has a slight correlation with the output already! It's not perfect, but it's there. Believe it or not, this is a huge part of how neural networks train. (Arguably, it's the only way that neural networks train.) What the training below is going to do is amplify that correlation. It's both going to update syn1 to map it to the output, and update syn0 to be better at producing it from the input! Note: The field of adding more layers to model more combinations of relationships such as this is known as "deep learning" because of the increasingly deep layers being modeled. ## 3 Layer Neural Network:`01.` `import` `numpy as np` `02.` `03.` `def` `nonlin(x,deriv` `=` `False` `):` `04.` `if` `(deriv` `=` `=` `True` `):` `05.` `return` `x` `*` `(` `1` `-` `x)` `06.` `07.` `return` `1` `/` `(` `1` `+` `np.exp(` `-` `x))` `08.` `09.` `X ` `=` `np.array([[` `0` `,` `0` `,` `1` `],` `10.` `[` `0` `,` `1` `,` `1` `],` `11.` `[` `1` `,` `0` `,` `1` `],` `12.` `[` `1` `,` `1` `,` `1` `]])` `13.` `14.` `y ` `=` `np.array([[` `0` `],` `15.` `[` `1` `],` `16.` `[` `1` `],` `17.` `[` `0` `]])` `18.` `19.` `np.random.seed(` `1` `)` `20.` `21.` `# randomly initialize our weights with mean 0` `22.` `syn0 ` `=` `2` `*` `np.random.random((` `3` `,` `4` `)) ` `-` `1` `23.` `syn1 ` `=` `2` `*` `np.random.random((` `4` `,` `1` `)) ` `-` `1` `24.` `25.` `for` `j ` `in` `xrange(` `60000` `):` `26.` `27.` `# Feed forward through layers 0, 1, and 2` `28.` `l0 ` `=` `X` `29.` `l1 ` `=` `nonlin(np.dot(l0,syn0))` `30.` `l2 ` `=` `nonlin(np.dot(l1,syn1))` `31.` `32.` `# how much did we miss the target value?` `33.` `l2_error ` `=` `y ` `-` `l2` `34.` `35.` `if` `(j` `%` `10000` `) ` `=` `=` `0` `:` `36.` `print` `"Error:"` `+` `str(np.mean(np.abs(l2_error)))` `37.` `38.` `# in what direction is the target value?` `39.` `# were we really sure? if so, don't change too much.` `40.` `l2_delta ` `=` `l2_error` `*` `nonlin(l2,deriv` `=` `True` `)` `41.` `42.` `# how much did each l1 value contribute to the l2 error (according to the weights)?` `43.` `l1_error ` `=` `l2_delta.dot(syn1.T)` `44.` `45.` `# in what direction is the target l1?` `46.` `# were we really sure? if so, don't change too much.` `47.` `l1_delta ` `=` `l1_error ` `*` `nonlin(l1,deriv` `=` `True` `)` `48.` `49.` `syn1 ` `+` `=` `l1.T.dot(l2_delta)` `50.` `syn0 ` `+` `=` `l0.T.dot(l1_delta)` Error:0.496410031903 Error:0.00858452565325 Error:0.00578945986251 Error:0.00462917677677 Error:0.00395876528027 Error:0.00351012256786 Recommendation: open this blog in two screens so you can see the code while you read it. That's kinda what I did while I wrote it. :) Everything should look very familiar! It's really just 2 of the previous implementation stacked on top of each other. The output of the first layer (l1) is the input to the second layer. The only new thing happening here is on line 43. Line 43: uses the "confidence weighted error" from l2 to establish an error for l1. To do this, it simply sends the error across the weights from l2 to l1. This gives what you could call a "contribution weighted error" because we learn how much each node value in l1 "contributed" to the error in l2. This step is called "backpropagating" and is the namesake of the algorithm. We then update syn0 using the same steps we did in the 2 layer implementation. ## Part 3: Conclusion and Future Work## My Recommendation:If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory. I know that might sound a bit crazy, but it seriously helps. If you want to be able to create arbitrary architectures based on new academic papers or read and understand sample code for these different architectures, I think that it's a killer exercise. I think it's useful even if you're using frameworks like Torch, Caffe, or Theano. I worked with neural networks for a couple years before performing this exercise, and it was the best investment of time I've made in the field (and it didn't take long).## Future WorkThis toy example still needs quite a few bells and whistles to really approach the state-of-the-art architectures. Here's a few things you can look into if you want to further improve your network. (Perhaps I will in a followup post.)• Alpha |