When training a neural network by stochastic gradient descent (ie. one example at a time) during the forward pass all the values in the network become known. The value of every derivative becomes known and of course all the weights are known.
Multi-layer linear systems can be viewed as an equivalent simple, single layer linear system.
The forward pass defines a simple linear system (with all numerical values known) that the backward pass will update.
Therefore the backward pass is only updating a simple linear system, an easy task. Which it does by updating the weights in each layer.
The non-linear effect of the weight updates will only be seen later when further training examples are presented.
Viewing the weighted sums in each layer as associative memory, an update to the weights for a single training pair causes a small amount of Gaussian noise to be added to the recall response of all other stored training pair responses.
The next training example will have have a small amount of Gaussian noise contamination of each weighted sum associative memory recall. And that noise will be amplified by the non-linear activation functions present in the network.
The hope is those non-linear effects will be small enough not to destabilize the network during training.
Which is generally the case.
For example it is very helpful that ReLU switches at zero rather than at some other value otherwise small amounts of Gaussian noise would be amplified by abrupt changes in the response of the activation function.
You can see then that a neural networks learns by linear system updates.
The non-linear effects are deferred till later. The cycle is:
Linear update.
Small non-linear perturbation of all other training example pairs (By added Gaussian noise interacting with non-linearity.)
Linear update.
Small non-linear perturbation of all other training example pairs.
......
Which is really just annealing, plain and simple.