We also add some extra evidence called a bias. Basically, we want to be able to say that some things are more likely independent of the input.

But it's often more helpful to think of softmax the first way: exponentiating its inputs and then normalizing them. The exponentiation means that one unit more evidence increases the weight given to any hypothesis multiplicatively. And conversely, having one less unit of evidence means that a hypothesis gets a fraction of its earlier weight.
Softmax then normalizes these weights, so that they add up to one, forming a valid probability distribution.

x = tf.placeholder("float", [None, 784])
(Here None means that a dimension can be of any length.) placeholder, a value that we'll input

A Variable is a modifiable tensor that lives in TensorFlow's graph of interacting operations. model parameters be Variables.

In order to train our model, we need to define what it means for the model to be good. Well, actually, in machine learning we typically define what it means for a model to be bad, called the cost or loss, and then try to minimize how bad it is. But the two are equivalent.

Where y is our predicted probability distribution, and y is the true distribution (the one-hot vector we'll input). In some rough sense, the cross-entropy is measuring how inefficient our predictions are for describing the truth. Going into more detail about cross-entropy is beyond the scope of this tutorial, but it's well worth understanding.

What TensorFlow actually does here, behind the scenes, is it adds new operations to your graph which implement backpropagation and gradient descent. Then it gives you back a single operation which, when run, will do a step of gradient descent training, slightly tweaking your variables to reduce the cost.