ICDAR 2019 Tutorial on

Vision and Language: the text modality in computer vision

Overview

The ability to properly exploit textual information in the image, or about the image to be analysed, is a capacity still missing from many computer vision systems.

Document image analysis has long strived to create intelligent reading systems, focusing exclusively on understanding textual and graphical information that is presented in image form.

Computer vision at large on the other hand, shows an increasing trend towards exploiting multimodal information in various ways. Translating from one modality to the other or deriving a joint embedding between modalities are the two key paradigms. Text is frequently one of the modalities of interest, although rarely this refers to text in image form.

In this tutorial we draw from recent advances in document analysis and computer vision, to showcase how text as a modality is currently dealt with in state of the art research. We will review various methods and applications, focusing on deep learning techniques used for multimodal embedding and cross-modal translation, which provide very powerful frameworks for modeling correlations between textual and visual information.

Some examples of applications that will be covered in this tutorial include:

Word-spotting, where the objective is to model the correlation between the visual (image) and the textual (transcription) representation of a string.

Dynamic lexicon generation, where the objective is to dynamically come up with a dictionary of words that have a high probability of appearing in the image, by exploiting the visual information of the scene, as a means to facilitate subsequent scene text recognition.

Self-supervised learning of visual features, where using one modality (the text) as the supervisory signal for the other (the image), offers a mechanism for learning useful features, avoiding costly annotations.

Cross-modal / multi-modal semantic retrieval of images, where the objective is to model the correlation between the visual information and the semantics derived from textual information to enable cross-modal image retrieval.

Image captioning, where the objective is to translate from the visual domain to the textual domain (natural language). An interesting twist over existing pipelines that we will be discussing in this tutorial is how textual information in the image or about the image to be described can be integrated in the captioning process.

The tutorial will start and finish in the document image analysis domain, but it is intended to take the audience on a tour through other research areas and applications, highlighting ideas that can be extrapolated and adapted to document image analysis, and also areas where document image analysis can play a crucial role for computer vision at large.

Time / Place

September 21, 2019

University of Technology Sydney (UTS)

Room: CB11.04.102

Expected Audience

The intended audience are researchers interested in tasks in which it is necessary to model the correlation between textual and visual information, which in principle should cover the totality of the audience of the ICDAR conference.

The material will be suitable for all levels of researchers, from PhD students and postdocs to senior researchers. Some basic knowledge about common computer vision, natural language processing, and machine learning techniques is highly encouraged, but will not be necessary to grasp the main messages of the tutorial.

Similarly, basic knowledge about deep learning, particularly convolutional neural networks (CNNs) and recurrent neural networks such as long short-term memory networks (LSTMs), is encouraged but not necessary.

Program

09:00 - 09:15 Introduction to multi-modal learning [PDF]

09:15 - 09:45 Semantic text embeddings [PDF]

09:45 - 10:30 Joint image-text embeddings for word spotting [PDF]

10:30 - 11:00 Coffee Break

11:00 - 11:30 Semantic image embeddings [PDF]

11:30 - 12:00 Cross-modal and multi-modal image retrieval [PDF]

12:00 - 12:15 Multi-modal image representations for classification [PDF]

12:15 - 12:30 Future directions: Text in Computer Vision [PDF]

People

Dimosthenis Karatzas is an associate professor at the Universitat Autònoma de Barcelona and associate director of the Computer Vision Centre (CVC) in Barcelona, Spain. At the CVC he leads the vision and language research line, working at the intersection of computer vision and text analysis. He has co-authored over 100 publications in refereed journals and conferences and has an H-index of 23.

He was the recipient of the 2013 IAPR/ICDAR Young Investigator Award, and Google Faculty Research Award in 2017. D. Karatzas has served in various roles at major conferences in his field (ICDAR, DAS, CBDAR, ICPR, ICFHR), including co-chairing IWRR 2014/16/18 and CBDAR 2015/17. D. Karatzas is a lead organiser of the Robust Reading Competitions series.

He is the chair of the Technical Committee 11 on Reading Systems of the Int. Association of Pattern Recognition. D. Karatzas has been a founding member and a member of the executive committee of the UK Chapter of the SPIE, while he is currently a member of the IAPR-Education Committee and member of the IEEE the IAPR. He is one of the founders of the Library Living Lab, an open participatory innovation space in a public library.

Marçal Rusiñol is an Associate Researcher at the Computer Vision Center within the Intelligent Reading Systems research group, being the PI of several competitive research and tech. transfer projects. He received his B.Sc. and his M.Sc. degrees in Computer Sciences from the Universitat Autònoma de Barcelona (UAB), Barcelona, Spain, in 2004 and 2006, respectively. In 2004 he joined the Computer Vision Center where he obtained the Ph.D. degree under the supervision of Dr. Josep Lladós in 2009. He has been a Teaching Assistant and an Adjunct Lecturer at the Computer Sciences Department of the Universitat Autònoma de Barcelona from 2005. He hold two postdoctoral Marie Curie fellowships at ITESOFT and at the L3i Lab in the Université de La Rochelle (France) respectively. His main research interests include Computer Vision, Machine Learning, Data Science, Information Retrieval and Performance Evaluation.

Lluís Gómez i Bigordà is a TECNIOspring Research Fellow (H2020 Marie Skłodowska-Curie actions of the European Union) at at the Computer Vision Center (CVC), Universitat Autònoma de Barcelona (UAB). He received his PhD in Computer Science from the Universitat Autònoma de Barcelona in 2016. As a member of the Robust Reading research team at the Computer Vision Centre, and of the document analysis community, he has contributed several papers to the field and has had the chance to collaborate with a variety of research groups and venues. He has collaborated with other prominent research groups in the organization of the ICDAR Robust Reading Competition in their 2013, 2015, and 2017 editions. He served as an area chair of the International Conference on Document Analysis and Recognition (ICDAR 2017), as a chair and organizer of the International Workshop on Camera Based Document Analysis and Recognition (CBDAR 2017) and the International Workshop on Robust Reading (IWRR 2018); as well as a member of the Program Committee of CBDAR 2015, IWRR 2014, IWRR 2016, and DAS 2018. In 2016 he co-organized a Tutorial on "Scene-Text Localization, Recognition, and Understanding" in the International Workshop on Document Analysis Systems (DAS 2016).

Raúl Gómez is researcher for the multimedia technologies group of Eurecat. He is doing his industrial PhD with the Universitat Autònoma de Barcelona advised by Dr Dimosthenis Karatzas and Dr Lluís Gómez (Computer Vision Center, UAB) and Dr Jaume Gibert (Eurecat). He received his BS degree in Telecommunications Engineering from Universitat Politècnica de Catalunya (UPC) and his MS degree in Computer Vision from Universitat Autònoma de Barcelona (UAB). He is working on scene interpretation using images and associated text, with a special focus on learning from web and social media data.

Yash Patel is currently pursuing a Master of Science in Computer Vision at Robotics Institute of Carnegie Mellon University, USA, where he works with Professor Abhinav Gupta. He holds B.Tech in Computer science with Honours by research from IIIT Hyderabad, where he worked in the fields of Computer Vision (CV) and Machine Learning (ML) under the supervision of Professor C.V. Jawahar. He has worked as an Applied Scientist Intern at Amazon-A9 with Professor Alex Smola and Professor R. Manmatha, a Research Intern at Center for Machine Perception, Prague under the supervision of Professor Jiri Matas, and worked as a Research Intern at Computer Vision Center, Barcelona under the supervision of Dr. Dimosthenis Karatzas. His research interests include - Self-Supervised Learning, Scene Text Understanding, 3D Reconstruction, Depth Estimation, Self-Paced/Curriculum Learning.