" Let's say we have a network for N-way classification, which has N outputs y[0], ..., y[N-1] over which a softmax function is applied. We'll also represent the labels as one-hot vectors with components t[0], ..., t[N-1]. For each example, only one of these is 1 and the others are 0. Then the cross-entropy is defined as: CE(y, t) = -sum(t[i] * log(y[i]), i = 0 .. N-1) This can also be interpreted as a negative log likelihood or a KL divergence between the output distribution and the target distribution. All of these amount to the same thing. Of course the binary cross-entropy is just a special case of the categorical cross-entropy when the number of classes is 2. The reason for the confusion, I think, is that in the case of a binary classification problem, you usually build a network with a single sigmoid output, where 0 means one class and 1 means the other class. You can of course do this because p(classA) = 1 - p(classB), so it's pointless to have the network compute both of them. So the binary cross-entropy is usually defined in terms of a single output y: CE(y, t) = - t * log(y) - (1 - t) * log(1 - y) The problems start when you apply this definition of the binary cross-entropy elementwise to each output of an N-way classifier where N > 2. That just... doesn't make sense. It definitely isn't equivalent to the proper way to compute the cross-entropy in that case. (Note that for N=2, it actually gives you two times the cross-entropy, so everything would still work correctly in that case apart from the updates being twice as large as they should be.) I accidentally did this once (in Theano it's a matter of using T.nnet.binary_crossentropy instead of T.nnet.categorical_crossentropy, an easy mistake to make) and it's a pretty tough bug to track down :) It's my understanding that the 'logits' should be a Tensor of probabilities, each one corresponding to a certain pixel's probability that it is part of an image that will ultimately be a "dog" or a "truck" or whatever... a finite number of things. ----------------------------------------------------------------------------------------------------------------------- These logits will get plugged into this cross entropy equation: As I understand it, the logits are plugged into the right side of the equation. That is, they are the q of every x (image). If they were probabilities from 0 to 1... that would make sense to me. But when I'm running my code and ending up with a tensor of logits, I'm not getting probabilities. Instead I'm getting floats that are both positive and negative:
So my question is... is that right? Do I have to somehow calculate all my logits and turn them into probabilities from 0 to 1? >> The crucial thing to note is that Therefore, the " An alternative way to write:
...would be:
However, this alternative would be (i) less numerically stable (since the softmax may compute much larger values) and (ii) less efficient (since some redundant computation would happen in the backprop). For real uses, we recommend that you use --------------------------------------------------------------------------------------------------------------------------
But when changing my A: The original MNIST example uses a one-hot encoding to represent the labels in the data: this means that if there are If There are (at least) two approaches you could try for binary classification: The simplest would be to set `NLABELS = 2` for the two possible classes, and encode your training data as`[1 0]` for label 0 and`[0 1]` for label 1. This answer has a suggestion for how to do that.You could keep the labels as integers `0` and`1` and use`tf.nn.sparse_softmax_cross_entropy_with_logits()` , as suggested in this answer.
-------------------------------------------------------------------------------------------------------------------------- It appears your
If you have more labels, have a look at There are a few stack overflow questions about computing one-hot embeddings with TensorFlow, and here is the accepted solution:
This is almost identical to the code in an official tutorial: https://www.tensorflow.org/versions/0.6.0/tutorials/mnist/tf/index.html To me it seems that since
Do you expect this implementation to be faster? And is it flawed for any other reason? -------------------------------------------------------------------------------------------------------------------------- The In the version suggested in the tutorial, the largest tensor will be the result of `tf.sparse_to_dense()` , which will be`32 x 1000000` .In the `one_hot()` function in the question, the largest tensor will be the result of`np.identity(1000000)` , which is 4 terabytes. Of course, allocating this tensor probably won't succeed. Even if the number of classes were much smaller, it would still waste memory to store all of those zeroes explicitly—TensorFlow does not automatically convert your data to a sparse representation even though it might be profitable to do so.
Finally, I want to offer a plug for a new function that was recently added to the open-source repository, and will be available in the next release. |