Data
This tab describes the datasets we used while creating our model, as well as the purpose of using each individual dataset. All of the following described datasets are either entirely open-source, or created by us. We linked sources to each dataset in the subsections below.
Our Own Character Handwriting Samples
We made a small database of our own handwriting of the alphabet and some common mathematical symbols. The following characters have more than 20+ samples: N, n, x, Summation, infinity, equals sign, plus sign. We chose those characters from our experiences of knowing how common they are in an EECS 351 lecture. The idea here was to get a small scale idea of using simple classifiers in MATLAB at the beginning of our project. Also, we wanted to obtain samples that use a variety of different backgrounds, and writing tools.
We implemented K Nearest Neighbors and a decision tree on this dataset with very minimal signal processing beforehand. The results of our initial experiment using this dataset are described in the Results tab.
This is a screenshot of our original database. Here is a link to view or download the database.
Image © Team 6 original work.
EMNIST Database Training Data
The EMNIST dataset is a publicly available dataset which contains samples of handwritten digits. There are 60,000 training samples and 10,000 testing samples. It contains handwriting samples from American high school students, as well as American Census Bureau employees. The samples are normalized to fit into a 28x28 pixel box and are anti-aliased.
We implemented K Nearest Neighbors, Multinomial Naïve Bayes, Support Vector Machine, LeNet Neural Network, and Decision Tree Classifiers on this dataset. See the Results tab for more information.
This link provides another great reference to gain a better understanding of this dataset.
Image © https://arxiv.org/pdf/1803.01900.pdf.
HASYv2 Database Training Data
The HASYv2 dataset is a publicly available dataset which contains samples of handwritten letters, digits, and mathematical symbols. Each character is a black and white, 32x32 image. There are a total of ~168,000 images in the dataset, each correctly labelled with one of the 369 potential classes.
This dataset was used to train our second neural network to recognize mathematical symbols. Rather than using all of the classes, we extracted the numbers, letters, and a small selection of other characters (∏, ∑, ∫, <, >, -, +, /, ×, ≈, [, ], ∞, 𝛕, →).
Sentence Database Training Data
For full-scale testing, we were interested in finding a database which contained handwriting samples of full sentences. However, one issue that we ran into was the lack of a dataset that matched the specifications for the kind of handwriting we were looking for. We wanted handwriting that was not in cursive, did not have any connected letters, and written in thick marker on a whiteboard. We looked into the possibility of the IAM handwriting dataset, which contains handwriting samples from 657 writers, and over 5000 samples of full sentences, but ultimately determined that it did not meet our requirements and we did not end up using the dataset.
Sentence Database Sample Images
Mathematical Expression Dataset
The CROHME dataset contains over 10,000 mathematical expressions. We were interested in using the CROHME dataset to test our mathematical expression classification capability. The CROHME dataset includes a variety of different writing utensils (i.e. different digital pens, whiteboards, and different tablets), which means this dataset does not have a consistent scales nor resolution. Therefore, we decided against using CROHME because we needed a consistent source to properly test our classification in its early stages of development.
CROHME dataset sample image.
Image ©