Towards models that can read and reason about scene text
Amanpreet Singh (Facebook AI Research)
Abstract: Even though scene text is ubiquitously present around us, is part of our daily life and what we see around us, it is not an important part of current deep learning pipelines. In this talk, I will talk about the multiple steps I took with my collaborators towards addressing this problem, specifically (i) TextVQA: a visual question answering dataset that contains questions requires models to read and reason about the scene text present in the images to answer it (ii) TextCaps: a dataset for image captioning with reading comprehension, and (iii) TextOCR: a large-scale dataset with ~1M scene text annotations to bridge the gaps on real images. Finally, we will also cover some state-of-art techniques on these datasets to showcase how scene text can help improve overall quality and usability of the deep learning pipelines.
Bio: Amanpreet Singh is currently working at Facebook AI Research. His research interests include vision-and-language reasoning and natural language understanding. He has worked on Adversarial VQA, TextVQA, TextCaps, TextOCR, and Hateful Memes dataset pushing forward the quality of datasets available in vision-and-language field. He also has been involved in making of GLUE, SuperGLUE and Dynabench leaderboards for natural language understanding. He is the creator of MMF framework for vision-and-language and has also worked on improving vision-and-language systems via self-supervised pretraining and addition of scene text reading and reasoning capabilities to existing models.
Understanding Data Visualizations via Question Answering
Brian Price (Adobe Research Labs)
Abstract: Data visualization such as bar charts and pie charts are effective ways to convey numerical information in a visual form. Until recently, computer algorithms could not parse such visualization. In this talk, I will review our early work in understanding and extracting information from charts using question answering. In this work, we introduced a new dataset for data visualization question answering that includes open ended questions and requires character recognition on out-of-vocabulary works. We also proposed three models for addressing this problem with the final proposed method achieving superhuman results on the existing datasets.
Bio: Brian Price is a Senior Research Scientist in Adobe Research. His research interests encompass computer vision, graphics and machine learning with specific focus on selection and segmentation (interactive and automatic, image and video, binary and matting), document understanding, and image synthesis.
He has developed technologies that have gone into many Adobe products including the Object Selection tool and Select and Mask tool in Photoshop, the Smart Selection and Auto Select tools in Photoshop Mix, and the Refine Edge and Key Cleaner features in AfterEffects. Recently, he has worked with the Acrobat team on document-related technologies. He received his PhD degree in computer science from Brigham Young University in 2010, and his adviser was Dr. Bryan Morse.
Scene Text-Aware Pre-training for Text-VQA and Text-Caption
Yijuan Lu (Microsoft Azure AI)
Abstract: In this talk, Lu will introduce Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks, which aims at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to conventional vision-language pre-training which fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) during pre-training. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), pre-training with scene text effectively helps the model learn a better aligned representation among three modalities: text word, visual object, and scene text. Thanks to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP could boost the absolute accuracy on the TextVQA dataset by +5:4%, compared to a non-TAP baseline. To further improve the performance, we build a large-scale scene text dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1:4 million images with OCR scene text. Pretrained on this OCR-CC dataset, our approach achieves the new state of the art and outperforms previous methods by a large margin on multiple tasks, i.e., +8:3% accuracy on TextVQA, +9:9% accuracy on ST-VQA, and +10:2 CIDEr score on TextCaps.
Bio: Yijuan (Lucy) Lu is a Principal Scientist at Microsoft Azure AI where she worked on invoice understanding, OCR core engine, and video understanding in the recent years. Her recent work on Text-VQA and Text-Caption won the #1 prize of Text-Caption challenge of Visual Question Answering workshop at CVPR 2021. Prior to joining Microsoft, she was an associate professor in the Department of Computer Science at Texas State University. Her major publications appear in leading publication venues in multimedia and computer vision research. She was the First Place Winner in many challenging retrieval competitions in Eurographics for many years. She received 2015 Texas State Presidential Distinction Award and 2014 College Achievement Award. She also received the Best Paper award from ICME 2013 and ICIMCS 2012. She has obtained many competitive external grants from NSF, US Army, US Department of Defense and Texas Department of Transportation.