Custom Layer

A softmax layer is defined such that: Each element in a softmax output vector gives the probability of being one class, as following figure shows

However, these probabilties might be very small. Logsoftmax is to take the log() of the softmax output vector, such that elements will not be too small. Each element in the Logsoftmax output can still denote the score of being one class.

Logsoftmax may be used to prevent underflow.

At forward time we have (with x = input vector, y = output vector, f = logsoftmax, i = i-th component):

yi = f(xi)
   = log( exp(xi) / sum_j(exp(xj)) )
   = xi - log( sum_j(exp(xj)) )

When computing the jacobian Jf of f you have:

i-th row

dyi/dxi = 1 - exp(xi) / sum_j(exp(xj))

And for k different than i:

dyi/dxk = - exp(xk) / sum_j(exp(xj))

This gives for Jf:

1-E(x1)     -E(x2)     -E(x3)    ...
 -E(x1)    1-E(x2)     -E(x3)    ...
 -E(x1)     -E(x2)    1-E(x3)    ...

With E(xi) = exp(xi) / sum_j(exp(xj))

So, If we name gradInput the gradient w.r.t input and gradOutput the gradient w.r.t output the backpropagation gives (chain rule):
gradInputi = sum_j( gradOutputj . dyj/dxi )

This is equivalent to:

gradInput = transpose(Jf) . gradOutput