Keynote talk by Aishwarya Agrawal

Visual Grounding in Visual Question Answering

Deep Mind, London

Abstract

In this talk, I will present our work on a multi-modal AI task called Visual Question Answering (VQA) -- given an image and a natural language question about the image (e.g., “What kind of store is this?”, “Is it safe to cross the street?”), the machine’s task is to automatically produce an accurate natural language answer (“bakery”, “yes”). Applications of VQA include -- aiding visually impaired users in understanding their surroundings, aiding analysts in examining large quantities of surveillance data, teaching children through interactive demos, interacting with personal AI assistants, and making visual social media content more accessible. Specifically, I will provide a brief overview of the VQA task, dataset and baseline models, and will elaborate on the problem of visual grounding in existing VQA models. I will talk about how to fix this problem by proposing -- 1) a new evaluation protocol, 2) a new model architecture, and 3) a novel objective function. Towards the end of the talk, I will talk about the challenges in VQA that we are yet to address, in spite of the tremendous amount of progress over the last few years.

agrawal_gecko20.mkv

Thank you for watching! What's next?

  • Attend the live discussion session for this keynote on May 18th! Check the program and consider registering in advance.
  • Leave comments/questions on our keynote talks in the #keynotequestions channel on geckosympo.slack.com. !! To access it you need to register for GeCKo !! You will then receive an invitation to Slack.
  • Explore! Watch the other keynote talks and check out the posters.