Most "Backprop explained" articles seem to focus on a 3 layer perceptron and get into arcane details of the math of matrix multiplication differentiation. While details are great in some situation, I think a pedagogically wholesome approach might be to look at the big picture, which is asking the question "how to backprop through an arbitrary node"?
Thanks to Rajeev for being a patient sounding board.
There is the purely functional "forward" part, then the "backward" part and then the state changing optimizer, which updates weights. Weights (and inputs) are "leaf" nodes in the backprop graph, ie backprop ends at that point. Anything else is expected to continue the good work and back propagate. Typically at the weights, we are interested in finding dL/dw, where L is the (typically scalar) loss of the network. Then the optimizer uses dL/dw in some way to update weights. The main workhorse of backprop is of course "chain-rule" and we'll get to that in a minute.
Let us define an arbitrary node f in a graph. It can have any number of inputs and any number of outputs. Let us denote it as:
y = f(x)
y and x are in bold, because there are multiple elements in them. Maybe x is composed of 3 things, 2 scalars and 1 vector
We can expand it and write it as:
y0, y1, ...yn = f(x0, x1,... xm)
When we start backprop for this arbitrary node f, most modern deep learning frameworks will make available to us the derivative wrt the outputs of the node, namely:
dL/dy which is short form for dL/dy0, dL/dy1, ... dL/dyn
dL/dy is available to us because the nodes that y went into, did their part and "backprop"ed those values to us
Now we need to find dL/dx which is short form for dL/dx0, dL/dx1, ... dL/dxm
What is done with dL/dx is not our concern right now. Maybe it propagates furthur, maybe one of the x's are a weight and the optimizer uses it to update the weight
Here we can make a simplification. We can imagine WLOG, that f can be split into f0, f1, ... fn, where each of the fi's are single output nodes
y0 = f0(x0, x1,... xm)
y1 = f1(x0, x1,... xm)
...
yn = fn(x0, x1,... xm)
Now all we need to do is find dL/dxi for each of these. Chain rule to the rescue: dL/dxi = dL/dy0 * dy0/dxi
Figure on the left shows decomposition of a multi-output node to multiple nodes with single outputs.
One last thing, when an output is feeding two different nodes, then during backprop, the gradients are added up. So if we have y=f(x), a=g(y), b=h(y), and L is some function of a and b, then dL/dy = dL/da * da/dy + dL/db * db/dy. Note that, among other places, the "fork" rule is needed in the above decomposition of a single multi-output node to multiple single output nodes.
Consider the following:
c = sq(x)
a = cube(c)
b = sq(c)
L = a - b
All in all, we have: L(x) = x^6 - x^4. The graph is shown on the right. Simple differentiation yields, dL/dx = 6x^5 - 4x^3. Let's now do it using backprop.
dL/da = d(a-b)/da = 1
dL/db = d(a-b)/db = -1
dL/dc = dL/da * da/dc + dL/db * db/dc
= 1 * d(c^3)/dc - d(c^2)/dc (because a = c^3 and b = c^2)
= 3c^2 - 2c
dL/dx = dL/dc * dc/dx
= (3c^2 - 2c) * d(x^2)/dx
= (3x^4 - 2x^2) * 2x
= 6x^5 - 4x^3
Consider this graph
a = f(x)
b = g(x)
L = a * b
We wish to find dL/dx
dL/da = d(a*b)/da = b
dL/db = d(a*b)/db = a
dL/dx = dL/da * da/dx + dL/db * db/dx ("forking" rule)
= b*da/dx + a*db/dx
This of course is the famous product-rule
You can do the same exercise for L = f(x) / g(x) and you should land up with the familiar quotient rule.
Consider a simple "mux" or "if-else" node. y = f(b, m, n), and y is set to m if b is true, else it is set to n
When backpropping, we have dL/dy. How to compute dL/dm or dL/dn? Branches of this sort are important in many cases and maxpooling is a very commonly used example from this category.
Another way to look at this is replacing the mux with an expression such as: y = b * m + (1 - b) * n. Normal backprop principles apply to this equation.
dL/dm = b * dL/dy
dL/dn = (1 - b) * dL/dy
A common technique in deep learning involves freezing the weights of lower layers (maybe to do fine-tuning). Of course, if certain weights do not need dL/dw computed for them that is less work during running the network. Therefore, presumably the deep learning framework would have a pass that would figure out how much of the backprop graph should be run. Consider an arbitrary tensor t in the middle of the graph. We do need to compute dL/dt if and only if there exists a path from some weight that needs to be updated to t. In some more detail:
Let W denote the set of all weights for which we need to compute dL/dw. Then for each w in W find all paths from w to the loss L. All tensors in this path will need their dL/dt computed. After this analysis, any leftover tensor t', will not need dL/dt' computed, and when we encounter such tensors, we can skip backproping furthur.
It seems a lot of ground can be covered by just these 3 basic "rules":
Multi output nodes can be split to multiple single output nodes
If an output goes to multiple inputs, add up the gradients
Chain rule
Rule 2 has strong correspondences with "parallel connection" or sum rules, and rule 3 has strong correspondences with "serial connections" and algebraic product rules. You know, in the way that enums/or gates/sufficient conditions are SUM-like and structs/and gates/necessary conditions are PRODUCT-like.
These basic rules give rise to familiar product and quotient rules, and form the basis of backprop in modern deep learning frameworks. Once we have understood this, it makes sense to focus on the linear algebra of backproping through fully connected layer, or the usage of the dL/dw in optimizers, which are related, but separate topics.