This section introduces the bare minimum of how a Naive Bayesian Classifier works.
Instance - A single instance of our data which we have to classify. In this project, instances are 28x28 black and white images. Each image is broken up into features.
Label - Every image properly belongs to a single label. In this project, labels are the digits 0 through 9. We use the variable Y for labels.
Feature - A feature is an observable property about an instance. In this project, features are the values of individual pixels.
Training Data - Also called the corpus, a set of data that has been correctly labeled, used to estimate our probabilities.
Validation Data - A separate set of data used to test our classifier, comparing the guessed labels with the known labels. The classifier does not see the labels for the validation data.
The goal of a classifier is to compute the most likely label for a given image . We break the image apart into features F0 F1 ... Fk. We then want to rank each digit Y by how closely it matches the image. The rank is a value that indicates how likely an image with features F0 F1 ... Fk should be labeled Y. We take the label with the highest rank to be the most likely label. Since the ranks are based on probabilities, we need additional vocabulary.
Unconditional Probability - The probability of Y, written P(Y), is the probability that some arbitrary image should be labeled Y.
Conditional Probability - The probability of F given Y, written P(F | Y), is the probability that feature F occurs in an image assuming that the image is correctly labeled Y.
We use this notation to define the rank of a digit Y for an image F0 F1 ... Fk, written R(Y | F0 F1 ... Fk). Ranks are very small fractions that are proportional to, and easier to compute than, the probability of Y being the correct label. Ranks are defined as the product of P(Y), the probability that the image should be labeled Y if we ignore the features, and P(Fi | Y), the probability of each feature in the image occurring if that is the correct label.
R(Y | F0 F1 ... Fk) = P(Y) * P(F0 | Y) * P(F1 | Y) * . . . * P(Fk | Y)
However, since we do not know the probabilities P(Y) or P(F | Y), we need to estimate them by examining our training data. This requires the last new notation.
Let n be the number of images in the corpus.
Let c(Y) be the number of images labeled Y in the corpus.
Let c(F, Y) be the number of images both labeled Y and having feature F.
We estimate probabilities by counting the number of matching and non-matching images in the corpus. We use the fraction of images in our corpus labeled Y to estimate P(Y): the probability that an arbitrary unclassified image should be labeled Y. Similarly, to estimate P(F | Y), the probability that an image labeled Y has feature F, we compute the fraction of total images labeled Y in the corpus that have feature F.
P(Y) ≈ c(Y) / n
P(Fi | Y) ≈ c(Fi, Y) / c(Y)
As a simplified example, consider 2x2 images of the digits 1 and 2. Each image will have four pixels, and thus eight features. Number them as follows:
Since there are two digits, we have two labels. Assume our corpus consists of the following images:
Images Labeled 1
Images Labeled 2
Now let us try to classify the following unclassified image.
This image can be broken into features F0 F1 F2 F3 where F0 is pixel 0 is white, F1 is pixel 1 is black, F2 is pixel 2 is white, and F3 is pixel 3 is white. To classify this image, we calculate the rank for each possible label, 1 and 2, and take the higher.
The rank for 1 is R(1 | F0, F1,F2, F3) = P(1) * P(F0 | 1) * P(F1 | 1) * P(F2 | 1) * P(F3 | 1). The probability that an arbitrary image should be labeled 1 can be estimated as P(1) ≈ c(1) / n. Since there are 3 images labeled 1, and 5 images total, this is ⅗. Next, we must compute the estimate of P(F0 | 1), the probability of pixel 0 being white, assuming the image should be labeled 1. As one of the three image labeled 1 has pixel 0 white, we get c(F0, 1) / c(1) = ⅓. Similarly, we estimate P(F1 | 1) ≈ c(F1, 1) / c(1) = ⅔, while P(F2 | 1) and P(F3 | 1) are both approximated at ⅓. Thus the rank for 1 is ⅗ * ⅔ * ⅓ * ⅓ * ⅓, or about 0.0148.
The Bayesian rank for 2 is P(2) * P(F0 | 2) * P(F1 | 2) * P(F2 | 2) * P(F3 | 2). The estimate of P(2) is ⅖. However, when we try to compute the estimate P(F0 | 2 ) ≈ c(F0, 1) / c(1) something interesting happens. F0 is the feature pixel 0 is white. There are no images both labeled 2 and where pixel 0 is white. Thus the numerator, c(F0, 1), is 0, we estimate P(F0 | 2 ) ≈ 0, and the rank of 2 for this image is
⅖ * 0 * P(F1 | 2) * P(F2 | 2) * P(F3 | 2) = 0.
Since 1 has the higher rank, we classify this image as a picture of 1.