In this project you will implement a naive Bayesian classifier for a set of scanned handwritten digit images. Your task is to guess which digit a handwritten image is supposed to be. To help, you are given a set, or corpus, of training images which have already been labeled with the correct digits. The goal of a classifier is to compute the most likely digit for an unclassified image. To do so, we break the image apart into features and rank each digit by how closely it matches the image. The details of how Naive Bayesian Classifiers work is covered in Naive Bayesian Classifiers. For this project, there are some specifics to consider.
Instances are 28x28 black and white images.
Labels are the digits 0 through 9.
Since images are 28x28, there are 784 different pixels.
There are two possible features for each pixel, one for each color. Thus theoretically there are 784*2=1568 features.
However, since the image is black and white, our code can get away with storing just 784 features, one for each pixel being black. If the feature is absent, the corresponding value in the code is False, and we can infer that the pixel must be white.
Even with simple features, your classifier will be able to guess 70-80% of the digits correctly, a steep improvement over the 10% strategy of guessing at random. Our data set consists of 5000 training images and 1000 validation images. Your classifier will use the training images to estimate the most likely digit for each validation image. This project will primarily use Haskell's list comprehensions, and some built-in functions that compute data over lists: most notably product, maximum, length, nub, and sum. Most data will be stored in association lists.
The simple implementation of this project will take about ~20 minutes to classify all 1000 images. A more efficient implementation can complete the task in about ~1 minute. The simple implementation will generally be worth a B grade. To obtain an A, you will have to precompute and store information, see Smoothing Estimated Probabilities and Avoiding Unnecessary Work below.
Github Classroom Repository Link
Monday 8/30 - Project Released.
Friday 9/3 - Milestone One deadline.
Tuesday 9/7 - Milestone Two (buildCorpus) deadline.
Wednesday 9/15 - Project submission deadline.
Friday 9/17 - Late submission deadline.
Be sure to push your changes to the repository for each deadline.
Late submissions are automatically accepted, but will be assessed a 5% (⅓rd letter grade) penalty.
You do not need to look at most of the files in the project. The ones that may be of interest are:
DigitRecognition.hs - This is the file you will edit (see Naive Bayesian Classification). The values and functions are listed in roughly the suggested implementation order. Problem details are given in the file, and this document primarily discusses the underlying theory.
TestInputs.hs - This file contains sample inputs you can use in ghci to test your functions. Many of these are from the grading tests, some are from the in-class examples.
README.md - This file contains a brief project description, as well as spots for your name and Trinity NetID (i.e. sfogarty). Be certain to add your name and ID.
Makefile - Used to build and test the project. Be certain make runs before you submit! Supports the following commands, which are run from the command line (not from inside ghci).
make - builds the project. The executable is output to classifier.
make clean - removes all generated files
digitdata/ - Contains the images and labels for our data set. You do not need to pass them into classifier, they are hard-coded as defaults.
singleimage - A single image, for testing purposes.
trainingimages/traininglabels - 5000 images and labels used to estimate probabilities.
validationimages/validationlabels - 1000 images and labels used to check your classifier.
testimages/testlabels - An alternate set of 1000 training images and labels, used by the --quick flag when testing your program.
The project can be built using make. If the executable does not build, use make setup to install any missing Haskell packages. The executable and compiled source files can be removed using make clean. The project compiles to classifier. To run grading tests on the project, use:
./classifier --test
To actually run the project on the images, I suggest ./classifier -q -v 2 -c 20. The following flags are supported:
-h or --help Prints a help message and exits.
--test Runs unit tests on your code. You must complete all milestone tests before the remaining tests will run.
-v 1 Verbose output: print results incrementally as they are computed.
-q or --quick Speeds computation by using a smaller training set.
-c <n> or --count <n> Only test the first <n> images.
--timeout <n> Time out computation after <n> minutes.
--ranking If you've implemented rankImage, outputs the ranks of each digit instead of simply the best digit.
For the milestones, you will need to implement the first five values in DigitRecognition.hs.
List comprehensions requires a list of inputs. Coincidentally, your first two problems are to define the lists allFeatures and allDigits. Do not overthink these problems.
The showPixelImage function converts an image into a string. Please read the comments thoroughly, it will make your life easier.
As mentioned, most data is stored in association lists. The lookupVal function looks up a value from an association list.
The helper function buildCorpus reorganizes the inputs read from the file into a more useful format.
Your project should pass all milestone tests.
For the core project you should implement the full naive Bayesian classifier. The code follows a similar structure to the example in Naive Bayesian Classifiers.
probOfDigit estimates the unconditional probability P(Y).
probOfFeature, and probOfNoFeature compute estimates for the conditional probabilities P(Fi | Y) for black and white pixels, respectively. Recall that the presence of white pixels is inferred by the absence of the corresponding black pixel.
rankOfDigit computes the rank R(Y | F0 F1 ... Fk) for a given digit.
classifyImage is the naive Bayesian classifier.
rankImage is optional, and returns an association list of digits and their rank. Look at valueOfRank. This is helpful for debugging, but not worth credit.
As thousands of 28x28 images are difficult to debug in Haskell, you may want to test your functions on a much smaller set of 2x2 images. The following code contains the images from the example, as well as the desired output of buildCorpus. Be careful, as the list of features for 2x2 images is quite different from those for 28x28 images.
img1a = [[False,True],[False,True]]
img1b = [[True,False],[True,False]]
img1c = [[True,True],[True,True]]
img2a = [[True,True],[False,True]]
img2b = [[True,False],[False,True]]
images = [(img1a, 1), (img1b, 1), (img1c, 1), (img2a, 2), (img2b, 2)]
corpusOutput = [(1, [img1a, img1b, img1c]), (2, [img2a, img2b])]
sortCorpus corpus = sort [(digit, sort images) | (digit, images) <- corpus]
--this will be true if your buildCorpus is working correct for this example.
yayBuildCorpus = (sortCorpus (buildCorpus images)) == (sortCorpus corpusOutput)
newImg= [[False,True],[False,False]]
Two of the functions in the milestone, lookupVal and buildCorpus, have additional requirements beyond produce the correct output. For full credit, lookupVal must check that there is exactly one value associated with the key. If there is not, you should produce reasonable error messages. For buildCorpus, you should avoid producing empty associations: if there are no images associated with a digit, there should not be a tuple for that digit in the output. Be careful, your implementation of buildCorpus should still be quick (linear in the size of the input).
You will notice that the program is slightly inaccurate and very slow. To get an A on this project, you will need to improve the accuracy and performance. I highly suggest you first get the project to work, pass all tests, and ensure you have an ~70% success rate. Then work on improving accuracy by smoothing estimated probabilities, getting a ~80% success rate. performance. Then work on improving performance.
The accuracy can be improved by smoothing the estimated probabilities. Recall that for a feature Fi of a specific pixel being present or absent, we estimate the probability of that feature for a specific digit as P(Fi | Y) ≈ c(Fi, Y) / c(Y). We count c(Fi, Y), the number of images with feature Fi labeled Y, and divide by c(Y), the total number of images labeled with that digit. However, it is possible that none of the training images for that digit will have that feature, and the numerator and thus estimated probability is zero. Since you are multiplying probabilities together, the entire rank will be 0. Thus, one stray pixel will result in a rank of 0, regardless of how many other pixels are a good match.
To fix this, smooth the probabilities so they are never zero. The simplest solution (worth 1 point) is to change the numerator to 1 + c(Fi, Y). This ensures that the estimated probability won't be zero. However, it adds the possibility of the result being greater than 1. The best smoothing will obey the following properties, in order of importance:
The estimate is never 0.
The estimate is never greater than 1.
The sum of the estimate of a pixel being black and the pixel being white is 1.
As you go, it is important that you do not distort the probability estimation. Ensure the following two properties remain true:
There is no possibility of collisions: two features with different counts ending up with the same estimate.
The smoothing is, in fact, smooth: any two features with the same difference in counts have the same difference in estimated probability.
Both the original estimate and the simplest smoothing described above preserve these properties. The best way to avoid breaking them is to keep your smoothing simple and consistent for all possible numerators.
The program is slow because many values are recomputed unnecessarily. Fixing this will require you to modify the program extensively on your own.
Every time you classify a new image, you are recomputing c(F, Y) for each digit and feature. However, these numbers do not change from image to image. To improve the efficiency of your program, you should precompute and save summaries of the images for each digit. First, gather the images by their label: this is what buildCorpus already does. Then, for each feature, compute and store a pair of Ints: what you store is up to you, but it should be enough to compute probOfFeature without looking at the list of images. Thus a Summary is a 28x28 nested list of tuples.
type Summary = [[(Int, Int)]]
Instead of the Corpus associating each digit with a list of PixelImages, it will now associate each digit with a Summary. You will also need to store how many images there originally were for each digit. The following type aliases reflect these changes.
type DigitSummary = [(Digit, Summary)]
type DigitCount = [(Digit, Int)]
type Corpus = (DigitCount, DigitSummary)
You will need to change most functions in DigitRecognition.hs to work with these new types. You may also need helper functions for building the corpus. Other files will not need to change. The following two definitions may help. featureGrid groups the features into lines, and is helpful for creating summaries.
featureGrid :: [[Feature]]
featureGrid = chunksOf 28 allFeatures
getFeature takes a summary and a feature, returning the specific tuple of positive and negatives counts..
getFeature :: Summary -> Feature -> (Int, Int)
getFeature sum ind =
let row = sum !! (ind `div` 28)
val = row !! (ind `mod` 28)
in val
Subject to change