TransDETR: End-to-end Video Text Spotting with Transformer

Weijia Wu, Chunhua Shen, Yuanqiang Cai, Debing Zhang, Ying Fu, Ping Luo, Hong Zhou

Abstract

Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. These methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines. In this paper, rooted in Transformer sequence modeling, we propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework ( TransDETR). TransDETR mainly includes two advantages: 1) Different from the explicit match paradigm in the adjacent frame, TransDETR tracks and recognizes each text implicitly by the different query termed ‘text query’ over long-range temporal sequence (more than 7 frames). 2) TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks ( e.g., text detection, tracking, recognition). Extensive experiments on four video text datasets ( i.e., ICDAR2013 Video, ICDAR2015 Video, Minetto, and YouTube Video Text) are conducted to demonstrate that TransDETR achieves the state-of-the-art performance with up to around 8.0% improvements on video text spotting tasks.


Pipeline Comparison

TransDETR Architecture

It contains three main components: 1). a backbone(e.g.,ResNet, PVT [40]) is used to extract feature representation of video frame sequences; 2) a Transformer encoder models the relations of pixel-level features, and a weight-shared Transformer decoder each arbitrary-oriented text move trajectory by one text query. For the initial F1 frame, an empty query set (yellow box) is injected into the decoder network to localize the initial objects and generate the initial text query for the next frame (F2). For the Ft frames, the specific text set from the previous frame is concatenated with an empty query set to generate the text query for the current frame; 3) an attention-based recognition head with Rotated RoI is designed to get the final text transcription.


Demo

BibTex

@article{wu2022transdetr,

title={End-to-End Video Text Spotting with Transformer},

author={Weijia Wu, Chunhua Shen, Yuanqiang Cai, Debing Zhang, Ying Fu, Ping Luo, Hong Zhou},

journal={arxiv},

year={2022}}