Database & Tools

Database Description

HANDS-VNOnDB (VNOnDB in short) provides 1,146 Vietnamese paragraphs of handwritten text composed of 7,296 lines, more than 480,000 strokes and more than 380,000 characters written by 200 Vietnamese. Writers were asked to write freely ground-truth text from a corpus of Vietnamese text. Our ground-truth text is derived from the VieTreeBank corpus, which contains all of the Vietnamese characters and some special symbols since it bases on Vietnamese newspapers.

For collecting patterns, we used Fujistu PC Tablets (FMVT8170) with stylus pen at a high sampling rate (120Hz). Each sequence contains multiple lines within various delayed strokes. Thus, it is appropriate for studying on Vietnamese online handwriting recognition.

Challenges of Vietnamese Online Handwriting Recognition

The specific characteristic of Vietnamese script is the diacritical marks (DMs). Combining DMs or diacritics with 6 original vowels (a, i, u, e, o, y) for creating 66 different derivative vowels is a peculiar characteristic of Vietnamese script. These DMs are written above and below a vowel. DMs and some consonants letters of Đ, đ, j, t, f are often written with delayed strokes. A delayed stroke for a character is written after writing some strokes for subsequent characters.

Short-distance delayed strokes are those written after one or a few strokes while long-distance delayed strokes are those written after more than a few strokes, say after a sentence. Delayed strokes are not always written for diacritical marks, and diacritical marks are not always written as delayed strokes.

InkML Description

The following is the structure of each InkML file in VNOnDB2018. There are two main sections: description section (including description, content_category, language, writer index, gender, age, ...) and trajectory data section (including multiple "traceGroup" elements).

Each "traceGroup" element contains a groundtruth text in "Tg_Truth" tag and some strokes data in "trace" elements which are represented by x and y-coordinates of points.

A sample of InkML file could be downloaded here.

<ink>
  <annotationXML>
    <Description>Cursive online handwriting</Description>
    <Content_Category>Text</Content_Category>
    <Language>Vietnamese</Language>
    <Writer_ID>id_xxxx</Writer_ID>
    <Gender>Male</Gender>
    <Age>22</Age>
    <Dominant_Hand>Left</Dominant_Hand>
    <Writing_Hand>Left</Writing_Hand>
    <Job>Student</Job>
    <Native_Language>Vietnamese</Native_Language>
    <Start_Time>2014-06-03T16:26:20</Start_Time>
    <DevName>FujitsuTabletPC</DevName>
    <SamplingRate>120</SamplingRate>
    <MaxNormalPressure>255</MaxNormalPressure>
    <Gt_File_Name>BCCTC</Gt_File_Name>
  </annotationXML>
  <traceGroup id="tg_0_0_0">
    <annotationXML>
      <Tg_Truth>Bản</Tg_Truth>
    </annotationXML>
    <trace id="tr_0_0"> x1 y1, x2 y2, x3 y3, ....</trace>
    <trace id="tr_0_1"> x11 y11, x12 y12, x13 y13, ....</trace>
  </traceGroup>
  ...
</ink>

Download

For each task, the training, validation, and test sets are provided. All provided files are written as the InkML format which encoded by Unicode (UTF-8).

[Database for competition will be uploaded in Feb. 2018 on http://tc11.cvc.uab.es/datasets/ ]

Database for competition has been uploaded and is being validated on TC-11 website http://tc11.cvc.uab.es/datasets/

Database for competition has been uploaded to TC-11 website.

Participants could download database via this link http://tc11.cvc.uab.es/datasets/HANDS-VNOnDB2018_1/

Copyrights

All database and tools are provided under a Creative Commons license for academic and research purposes, but not for commercial use.

For commercial and other purposes, please contact us in advance.