This article analyzed the impact of training data containing un-annotated text instances, i.e., partial annotation in scene text detection, and proposed a text region refinement approach to address it. Scene text detection is a problem that has attracted the attention of the research community for decades. Impressive results have been obtained for fully supervised scene text detection with recent deep learning approaches. These approaches, however, need a vast amount of completely labeled datasets, and the creation of such datasets is a challenging and time-consuming task. Research literature lacks the analysis of the partial annotation of training data for scene text detection. We have found that the performance of the generic scene text detection method drops significantly due to the partial annotation of training data. We have proposed a text region refinement method that provides robustness against the partially annotated training data in scene text detection. The proposed method works as a two-tier scheme. Text-probable regions are obtained in the first tier by applying hybrid loss that generates pseudo-labels to refine text regions in the second-tier during training.
Scene text appears with a wide range of sizes and arbitrary orientations. For detecting such text in the scene image, the quadrilateral bounding boxes provide a much tight bounding box compared to the rotated rectangle. In this work, a vector regression method has been proposed for text detection in the wild to generate a quadrilateral bounding box. The bounding box prediction using direct regression requires predicting the vectors from each position inside the quadrilateral. It needs to predict four-vectors, and each varies drastically in its length and orientation. It makes the vector prediction a difficult problem. To overcome this, we have proposed a centroid-centric vector regression by utilizing the geometry of quadrilateral. In this work, we have added the philosophy of indirect regression to direct regression by shifting all points within the quadrilateral to the centroid and afterward performed vector regression from shifted points.
This paper proposes a Convolutional Neural Network (CNN) based Text Region Proposal Network (TRPN) for generating word level region proposal. The proposed architecture is capable of getting trained on the low memory GPU and achieves a good recall with a limited number of region proposals to reduce the overhead on the detection and recognition task. The number of parameters of the proposed architecture is reduced to make it trainable on the low memory GPU by first decreasing the number of kernels and increasing them afterward. The in-network fusion is used to maintain localization accuracy which was reduced due to max-pooling operation. This fuses the feature map of different levels to obtain an efficient localization of text with different aspect ratio and size. The proposed architecture achieves a competitive recall with few tens of region proposals which are less compared to the state-of-the-art region proposal methods for text.