Towards A Dataset Agnostic Word Segmentation

Gregory Axler and Lior Wolf

PDF

Code

Abstract

We present a flexible and general framework for word segmentation in handwritten documents, which incorporates techniques from the recent object detection literature as well as document analysis tools .

Our method utilizes information that is relevant for word segmentation and ignores other highly variable information contained in a handwritten text, thus allowing for efficient transfer learning between datasets and alleviating the need for labeled training data.

Our approach efficiently detects words in a variety of scanned document images, including historical handwritten documents and modern day handwritten documents. In addition, we demonstrate the usefulness of our approach by achieving state-of-the-art results for segmentation-free word spotting tasks.

Automatic analysis of cluttered documents

A major bottleneck in humanities research is the time required to manually sift through old manuscripts. Researchers working with old manuscripts can spend several months with a single book. The ability to perform reliable extraction and access to specific information and perform digitally assisted document analysis might provide a valuable tool in mitigation of the mentioned bottleneck and facilitate incorporation of quantitative methods in humanities research.

Word segmentation in documents is a critical stage towards word and character recognition, as well as word spotting. Despite recent advancements in word segmentation and object detection, detecting instances of words in a cluttered handwritten document remains a non-trivial task that requires a large amount of labeled documents for training.

Segmented cluttered document

Method

Segmentation methods are based on connected components suffer from several drawbacks, including the need for preprocessing and limited utilization of the available information. These issues severely limit the usability of such methods to cluttered or corrupt handwritten documents. Deep learning based methods allow a much better utilization of available information and added versatility in the types of documents that can be segmented, bounding box proposal generation still remains a bottleneck. This is due to imperfect suitability of existing bounding box proposal generation processes to handwritten text documents. Moreover, deep learning based methods depend on training data availability, which in a setting of historical manuscripts might be a limiting factor.

To address the above issues. we propose a fully convolutional neural network based method that implicitly incorporates gap classification into the proposal generation process, and produces a box proposal generation process that is compatible with the structure of text documents. Also, our method utilizes information that is relevant for word segmentation and ignores other highly variable information contained in a handwritten text, thus allowing for efficient transfer learning between datasets and alleviating the need for labeled training data.

Model Overview

ICDAR document

ICDAR Heatmap - generated by the heatmap network

Google Sites

Report abuse