Video-Language

VideoLanguage is an open source database and codebase for video-and-language tasks from our team.

Supported Datasets and Algorithm

The project includes(Algorithm):

  • TransDETR: End-to-end Video Text Spotting with Transformer

  • Contrastive Learning of Semantic and Visual Representations for Text Tracking

This project includes(Benchmark):

  • BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting

  • ViTVR: A Large-Scale Video Retrieval Benchmark with Vision and Text Aggregation

Datasets

BOVText: a new large-scale benchmark dataset named Bilingual, Open World Video Text(BOVText), the first large-scale and multilingual benchmark for video text spotting in a variety of scenarios. All data are collected from KuaiShou and YouTube.

There are mainly three features for BOVText:

  • Large-Scale: we provide 2,000+ videos with more than 1,750,000 frame images, four times larger than the existing largest dataset for text in videos.

  • Open Scenario:BOVText covers 30+ open categories with a wide selection of various scenarios, e.g., life vlog, sports news, automatic drive, cartoon, etc. Besides, caption text and scene text are separately tagged for the two different representational meanings in the video.

  • Bilingual:BOVText provides Bilingual text annotation to promote multiple cultures live and communication.

Questions?

Contact [email] to get more information on the project