Qingju LIU's homepage - 200731 Cross Entropy

Suppose a probability distribution [p1, p2, …] whose sum equals one and each class probability prediction is {\hat p_i }, the cross entropy is calculated as

For a one hot probability distribution, i.e. only one item in the ground truth probability vector equals to one while others zero, the categorical cross entropy can by simplified as

For a binary classification problem, the ground truth label y either equals to 1 or 0, which is equivalent to the distribution vector of [y,1-y]. The prediction {\hat y} is equivalent to the the distribution vector of [{\hat y}, 1-{\hat y}]. Thus the binary cross entropy can be calculated as

For a multi-class classification (K) problem, the output is a vector, however, not a distribution vector (sum not equal to one), since an input can have multiple class labels. E.g. the ground truth label is [1, 1, 0, 0, 0, 0] while the output is [0.7, 0.8, 0.2, 0.3, 0.1, 0.1]. The multi-class cross entropy is the summation of binary cross entropy over K classes

Cross entropy can be used in other classification problems such as logistic regression (LR).

For a pair (x, y) where x is the input vector and y is the label, and the input distribution vector is equivalent to [y, 1-y]. The chance the predicted label {\hat y} equals 1 is calculated as

where f(x) is the linear conversion of the input vector x. The output distribution vector is thus [p1, 1-p1]. The above binary cross entropy loss can be used directly by replacing {\hat y} to p1.

The cross entropy is related to the KL divergence that CrossEntropy(p,q) = KL(p,q) + Entropy(p), where p and q are two distribution vectors, and they are both asymmetric. If p is fixed (e.g. a given groundtruth distribution), and we try to obtain the optimal q, then minimising CE(p,q) is equivalent to minimising KL(p,q). However, if q is fixed and we try to obtain optimal p, minimising the cross entropy means pushing p as flat as possible (thus entropy(p) is also minimised). On the country, minimising the KL term will more likely to obtain spiky distribution.