We introduce a chinese language-vision pretrained model for crossmodal understanding, vision embedding, multimodal retrieval, which is built on 0.37B image-text pair datasets and with 0.6B parameters.
The corresponding technology is applied in almost all businesses in content platform, including video aduit, video tagging and standardization, video retrieval, video fingerprints, etc
We introduce a Shrinking Temporal Attention Transformer, which achieve state-of-the-art results on multiple action recognition benchmarks including Kinetics400 and Something-Something v2, outperforming prior methods with 50% less FLOPs and without any pretrained model.
The corresponding technology is applied in WeSee, QQ browser, Wechat Channel, and achieves two second places in TAAC.
We have migrated the CLIP model to video scenes and achieved the first cross-modal retrieval of video-text and text-video on multiple datasets, including MSRVTT, MSVD, and VATEX.
The corresponding technology is applied in video material retrieval and intelligent editing of Tencent Zenvideo and Tencent Video.
We propose a set of innovative designs to tackle the problem of practical stereo matching. Our results not only rank 1st on both Middlebury and ETH3D benchmarks, outperforming existing state-of-the-art methods by a notable margin, but also exhibit high-quality details for real-life photos
The corresponding technology is applied in stereo bokeh in Megvii.
We propose a novel Patchmatch-based framework to work on high-resolution optical flow estimation. Our method has a strong cross-dataset generalization ability that achieves the best published result on KITTI2015. Also it shows a good details preserving result on the high-resolution dataset DAVIS and consumes 2× less memory than RAFT
The corresponding technology is applied in Low-light shooting and multi-camera smoothing in Megvii.
We propose A Pyramid Attention Network(PAN) exploit the impact of global contextual information in semantic segmentation. It achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks.
The corresponding technology is applied in video segmentation and video bokeh in Megvii.
We have designed real-time HDR algorithm and Raw based denoising algorithm. It achieves 9x fewer FLOPs, 4x fewer parameters and 3x faster inference speed than the existing methods while providing comparable accuracy.
The corresponding technology is applied in Low-light shooting and multi-camera smoothing in Megvii.
We firstly introduce a deep learning based approach to remove distortion artifacts from freely-shot photos. our approach significantly outperforms the previous state-of-the-art approach both qualitatively and quantitatively.
The corresponding technology is applied in portraits correction in Megvii.
We propose a contentaware layout generation network which takes glyph images and their corresponding text as input and synthesizes aesthetic layouts for them automatically.
The corresponding technology is applied in poster generation and logo genetation in TencentVideo.
We adopt a part heatmap regression network to predict the landmark points on a local granularity by generating a series of heatmaps for each 3D landmark point, which wins the first places in 300VW.
The corresponding technology is applied in 3D avatar in Megvii.