ctpn

CTPN

Detecting Text in Natural Image with Connectionist Text Proposal Network

CTPN uses convoluational neural network (e.g. VGG16) to extract features from raw image.

It assumes text on the image is horizontal, so it only needs to predict the vertical gap between text lines.

This is typical recurrent neural network task. 

A series of rows of the feature map are fed into a bidirectional lstm, so it can predict if a pixel on the feature map

corresponds to a gap or text.

Every row can contain multiple text / gaps. If there are adjacent texts, they will be connected into a line.

The height of text needs to be predicted as well. Here it uses the idea of Anchor (from Faster R-CNN) which provides

k possible heights 11, 16 .. 283 ( divided by 0.7 each time). There seems to be more regression and refinement in later

steps to bound the text more precisely. Need to read the Faster R-CNN paper to understand it better.

Although the method is designed for horizontal text, it still works if the text rotates by a certain degree ( less than 45 degree?).

Note that the prediction is bounding box, a curved /angled text will cause some problem. e.g. two long lines of angled text would have a big box, while the text goes diagonally from one corner to another.