VideoLanguage - Baselines

Algorithms

Rooted in Transformer sequence modeling, we propose a novel video text DEtection, Tracking, and Recognition framework (TransDETR), which views the VTS task as a direct long-sequence temporal modeling problem.

TransDETR mainly includes two advantages:

1) Different from the explicit match paradigm in the adjacent frame, TransDETR tracks and recognizes over long-range temporal sequence(more than 7 frames).

2) TransDETR is an end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks(text detection, tracking, recognition).

github.com/weijiawu/TransVTSpotter

A good baseline for video text spotting.

Page updated

Google Sites

Report abuse