Datasets for Document Analysis & Recognition - Handwritten Datasets (Urdu,English)

Introduction:

Handwritten dataset is presented here with an aim to assist the researchers working in the domain of handwriting analysis with particular to writer identification/verification to experiment and quantify their algorithm/techniques on this common benchmark to know a cumulative and overall value of their algorithm.This database contains handwritten samples in two languages i.e English and Urdu.Each writer produced four samples of its handwriting containing two for English and two for Urdu as a whole.The database contains 176 scanned handwritten samples produced by 44 writers,hence 88 samples for each script.The Handwritten samples are scanned in ".png" format with 300dpi as resolution.

Structure & Ground Truth:

As for as the handwritten sample structure(as shown in figure) is concerned,we have followed the IAM [1] sample layout owing to its ease and flow. Each sample handwritten document is divided into four parts.The Uppermost section contains Unique form ID.Second part contains printed text to be written by the writer.Third part is for the writer to write the printed text in its natural handwriting.Fourth and final part of the sample gives the writer an optional opportunity to write down its name.

The ground truth is provided in the form of Unique Form/Document name(image name) for writer identification verification purposes.Each image is given a Unique name according to this convention.

[One Letter Category Code][Two Digit Form Code/Form count]-[Three Digit Writer ID][Letter U representing Urdu]

To get the handwritten samples(four samples) of a writer just look for same three digit writer ID only, for four forms in the whole database.