Trang chủ‎ > ‎IT‎ > ‎Data Science - Python‎ > ‎Tensorflow‎ > ‎

Tensorflow cost function consideration

"sigmoid can be used with cross-entropy. and softmax can be used with log-likelihood cost"

Let's say we have a network for N-way classification, which has N outputs y[0], ..., y[N-1] over which a softmax function is applied. We'll also represent the labels as one-hot vectors with components t[0], ..., t[N-1]. For each example, only one of these is 1 and the others are 0. Then the cross-entropy is defined as:

CE(y, t) = -sum(t[i] * log(y[i]), i = 0 .. N-1)

This can also be interpreted as a negative log likelihood or a KL divergence between the output distribution and the target distribution. All of these amount to the same thing.

Of course the binary cross-entropy is just a special case of the categorical cross-entropy when the number of classes is 2. The reason for the confusion, I think, is that in the case of a binary classification problem, you usually build a network with a single sigmoid output, where 0 means one class and 1 means the other class. You can of course do this because p(classA) = 1 - p(classB), so it's pointless to have the network compute both of them. So the binary cross-entropy is usually defined in terms of a single output y:

CE(y, t) = - t * log(y) - (1 - t) * log(1 - y)

The problems start when you apply this definition of the binary cross-entropy elementwise to each output of an N-way classifier where N > 2. That just... doesn't make sense. It definitely isn't equivalent to the proper way to compute the cross-entropy in that case.

(Note that for N=2, it actually gives you two times the cross-entropy, so everything would still work correctly in that case apart from the updates being twice as large as they should be.)

I accidentally did this once (in Theano it's a matter of using T.nnet.binary_crossentropy instead of T.nnet.categorical_crossentropy, an easy mistake to make) and it's a pretty tough bug to track down :)


It's my understanding that the 'logits' should be a Tensor of probabilities, each one corresponding to a certain pixel's probability that it is part of an image that will ultimately be a "dog" or a "truck" or whatever... a finite number of things.

-----------------------------------------------------------------------------------------------------------------------

These logits will get plugged into this cross entropy equation: Cross entropy formula from Wikipedia

As I understand it, the logits are plugged into the right side of the equation. That is, they are the q of every x (image). If they were probabilities from 0 to 1... that would make sense to me. But when I'm running my code and ending up with a tensor of logits, I'm not getting probabilities. Instead I'm getting floats that are both positive and negative:

-0.07264724 -0.15262917  0.06612295 ..., -0.03235611  0.08587133 0.01897052 0.04655019 -0.20552202  0.08725972 ..., -0.02107313 -0.00567073 0.03241089 0.06872301 -0.20756687  0.01094618 ...,   etc

So my question is... is that right? Do I have to somehow calculate all my logits and turn them into probabilities from 0 to 1?

>> The crucial thing to note is that tf.nn.softmax_cross_entropy_with_logits(logits, labels)performs an internal softmax on each row of logits so that they are interpretable as probabilities before they are fed to the cross entropy equation.

Therefore, the "logits" need not be probabilities (or even true log probabilities, as the name would suggest), because of the internal normalization that happens within that op.

An alternative way to write:

xent = tf.nn.softmax_cross_entropy_with_logits(logits, labels)

...would be:

softmax = tf.nn.softmax(xent)
xent = -tf.reduce_sum(labels * tf.log(softmax), 1)

However, this alternative would be (i) less numerically stable (since the softmax may compute much larger values) and (ii) less efficient (since some redundant computation would happen in the backprop). For real uses, we recommend that you use tf.nn.softmax_cross_entropy_with_logits().

--------------------------------------------------------------------------------------------------------------------------


Q: I am trying to adapt this MNIST example to binary classification.

But when changing my NLABELS from NLABELS=2 to NLABELS=1, the loss function always returns 0 (and accuracy 1).

A: The original MNIST example uses a one-hot encoding to represent the labels in the data: this means that if there are NLABELS = 10 classes (as in MNIST), the target output is [1 0 0 0 0 0 0 0 0 0] for class 0, [0 1 0 0 0 0 0 0 0 0] for class 1, etc. The tf.nn.softmax() operator converts the logits computed by tf.matmul(x, W) + b into a probability distribution across the different output classes, which is then compared to the fed-in value for y_.

If NLABELS = 1, this acts as if there were only a single class, and the tf.nn.softmax() op would compute a probability of 1.0 for that class, leading to a cross-entropy of 0.0, since tf.log(1.0) is 0.0 for all of the examples.

There are (at least) two approaches you could try for binary classification:

  1. The simplest would be to set NLABELS = 2 for the two possible classes, and encode your training data as [1 0] for label 0 and [0 1] for label 1. This answer has a suggestion for how to do that.

  2. You could keep the labels as integers 0 and 1 and use tf.nn.sparse_softmax_cross_entropy_with_logits(), as suggested in this answer.


--------------------------------------------------------------------------------------------------------------------------

It appears your y_train contains the label values themselves, whereas the y_train needed by the model requires label probabilities. In your case, since there are only two labels, you can convert that to label probabilities as follows :

y_train = tf.concat(1, [1 - y_train, y_train])

If you have more labels, have a look at sparse_to_dense to convert them to probabilities.


There are a few stack overflow questions about computing one-hot embeddings with TensorFlow, and here is the accepted solution:

num_labels = 10
sparse_labels = tf.reshape(label_batch, [-1, 1])
derived_size = tf.shape(label_batch)[0]
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels])
outshape = tf.reshape(tf.concat(0, [derived_size, [num_labels]]), [-1])
labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)

This is almost identical to the code in an official tutorial: https://www.tensorflow.org/versions/0.6.0/tutorials/mnist/tf/index.html

To me it seems that since tf.nn.embedding_lookup exists, it's probably more efficient. Here's a version that uses this, and it supports arbitrarily-shaped inputs:

def one_hot(inputs, num_classes):
    with tf.device('/cpu:0'):
        table = tf.constant(np.identity(num_classes, dtype=np.float32))
        embeddings = tf.nn.embedding_lookup(table, inputs)
    return embeddings

Do you expect this implementation to be faster? And is it flawed for any other reason?


--------------------------------------------------------------------------------------------------------------------------

The one_hot() function in your question looks correct. However, the reason that we do not recommend writing code this way is that it is very memory inefficient. To understand why, let's say you have a batch size of 32, and 1,000,000 classes.

  • In the version suggested in the tutorial, the largest tensor will be the result of tf.sparse_to_dense(), which will be 32 x 1000000.

  • In the one_hot() function in the question, the largest tensor will be the result of np.identity(1000000), which is 4 terabytes. Of course, allocating this tensor probably won't succeed. Even if the number of classes were much smaller, it would still waste memory to store all of those zeroes explicitly—TensorFlow does not automatically convert your data to a sparse representation even though it might be profitable to do so.

Finally, I want to offer a plug for a new function that was recently added to the open-source repository, and will be available in the next release. tf.nn.sparse_softmax_cross_entropy_with_logits() allows you to specify a vector of integers as the labels, and saves you from having to build the dense one-hot representation. It should be much more efficient that either solution for large numbers of classes.

Comments