Talk Schedule, Titles and Abstracts

Tentative Draft Schedule

  • Day 1 - 27th December

  • Reach Mysore by 1 pm. Lunch till 2:30 pm
  • Vision Session 2:30 - 5 pm
    • Talk 1 2:30pm - 3:20 pm Abhinav Gupta
    • Talk 2 3:20pm - 4:10 pm Karteek Alahari
    • Talk 3 4:10pm - 5:00 pm Vineet Gandhi
  • Poster Session with Tea 5 pm - 6:30 pm
  • Talk Session 6:30 - 8 pm
    • Talk 4 6:30 pm - 7:20pm Manohar Paluri
    • Short Talk 7:20 pm - 7:40 pm Venkatesh Babu
    • Short Talk 7:40 pm - 8 pm CV Jawahar
  • Dinner 8 pm onwards
  • Day 2 - 28th December

  • Breakfast 8 am - 9 am
  • Vision and Language Session1 9 am - 10:40 am
    • Talk 5 9:00am - 9:50 am Andrew Zisserman
    • Talk 6 9:50am - 10:40 am Devi Parikh
  • Coffee break 10:40 - 11 am
  • Short talk Session 11 am - 1pm
    Ankush Gupta, Anoop Namboodiri, Kaushik Mitra, Vineeth Balasubramanian
  • Vision and Language Session2 2 pm - 3:40 pm
    • Talk 7 2:00pm - 2:50 pm Dhruv Batra
    • Talk 8 2:50pm - 3:40 pm Santosh Divvala
  • Coffee break 3:40pm - 4 pm
  • AI Session1 4 pm - 5:40 pm
    • Talk 9 4:00pm - 4:50 pm Deepak Pathak
    • Talk 10 4:50pm - 5:40 pm Vinay Namboodiri
  • Day 3 - 29th December

  • Talk Session 9 am - 10:10 am
    • Talk 11 9am - 9:40 am Pulkit Agarwal
    • Short Talk 9:40 am - 10:10 am Soma Biswas
  • Coffee break 10:10am - 10:30 am
  • Talk Session 10:30 am - 12:10 pm
    • 10:30 am - 11:20 am Short Talks Gaurav Sharma, Chetan Arora
    • Talk 12 11:20am - 12:10 pm Subhransu Maji

Talk Abstracts

  • Abhinav Gupta , Title: Beyond Supervised Feedforward ConvNets
    Better models for Detection using feedback and top-down connections and training in unsupervised manner

  • Andrew Zisserman, Sequence to Sequence Models for Computer Vision
    Sequence to sequence models have been used successfully in many disparate application areas, such as machine translation, text recognition, and automated speech recognition. Over the last two years these models have also seen applications in computer vision, such as automatic caption generation.

    In this talk I will first introduce the sequence to sequence architecture, and then describe two computer vision applications: automated lip reading, and automated recognition of sign language. The goal of the lip reading project is to recognize phrases and sentences being spoken by a talking face, with or without the audio, in an open world scenario - with unrestricted vocabulary and in the wild videos. The goal of the sign language project is to translate phrases and sentences in British Sign Language (BSL) into English. Both of these require large aligned datasets for training the sequence to sequence model. We describe how such datasets can be built automatically, and the progress achieved so far on the two goals.

    This work is with Joon Son Chung, Oriol Vinyals and Andrew Senior.

  • Deepak Pathak, Unsupervised learning of Visual Representations

  • Devi Parikh, Making the V in VQA Matter: Introducing the Visual Question Answering (VQA) 2.0 Dataset
    In this talk, I will introduce the Visual Question Answering (VQA) 2.0 dataset. 
    Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.
    We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the current VQA dataset by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has twice the number of image-question pairs. We benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors -- confirming what seems to be a qualitative sense among practitioners.
    I will also talk about several ongoing VQA-related directions in my lab: a new hierarchical co-attention model that not only reasons about image attention (i.e., where to look) but also about question attention (i.e., what to listen to); learning to count everyday objects in everyday scenes as a first step towards answering "how many" questions accurately; making VQA models more human-like by training them to recognize which questions are not relevant for a given image and hence shouldn't be answered at all; and finally, thinking of different questions about images as viewing images from different perspectives and using these perspectives to improve other image understanding tasks such as image-caption ranking.

  • Dhruv Batra, Visual Dialog
    We are witnessing unprecedented advances in computer vision (CV) and artificial intelligence (AI). What lies next for AI? We believe that the next generation of visual intelligence systems will need to posses the ability to hold a meaningful dialog with humans in natural language about visual content for applications like:
    • Aiding visually impaired users in understanding their surroundings or social media content (AI: ‘John just uploaded a picture from his vacation in Hawaii’, Human: ‘Great, is he at the beach?’, AI: ‘No, on a mountain’)
    • Aiding analysts in making decisions based on large quantities of surveillance data (Human: ‘Did anyone enter this room last week?’, AI: ‘Yes, 27 instances logged on camera’, Human: ‘Were any of them carrying a black bag?’),
    • Interacting with an AI assistant (Human: ‘Alexa – can you see the baby in the baby monitor?’, AI: ‘Yes, I can’, Human: ‘Is he sleeping or playing?’).
    • Robotics applications (e.g. search and rescue missions) where the operator may be ‘situationally blind’ and operating via language (Human: ‘Is there smoke in any room around you?’, AI: ‘Yes, in one room’, Human: ‘Go there and look for people’).
    As a step towards conversational visual AI, we introduce the task of — Visual Dialog.
    Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress.
    We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). Data collection is underway and on completion, VisDial will contain 1 dialog with 10 question-answer pairs on all ~200k images from COCO, with a total of 2M dialog question-answer pairs.
    I will introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders -- Late Fusion, Hierarchical Recurrent Encoder and Memory Network -- and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. Putting it all together, I will demonstrate the first 'visual chatbot’!

  • Karteek Alahari, Learning Motion Patterns and their use for Semantic Segmentation
    The first part of the talk addresses the task of determining whether an object is in motion, irrespective of camera motion, by learning motion patterns in videos. The core of our approach is a fully convolutional network (FCNN), which is learnt entirely from synthetic video sequences. This encoder-decoder style architecture first learns a coarse representation of the optical flow field features, and then refines it iteratively to produce motion labels at the original resolution. We demonstrate the benefits of this learning framework on the moving object segmentation task.

    The second part presents the use of motion cues for semantic segmentation. FCNNs, the new state of the art for this task, rely on a large number of training images with strong pixel-level annotations. To address this, we present motion-CNN (M-CNN), a novel FCNN framework which incorporates motion cues, and is learned from video-level weak annotations. We demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images.

  • Manohar Paluri, Challenges and opportunities learned building large scale AI systems
    I feel tremendously lucky to be part of an era where progress is happening rapidly. While we are yet to understand how we define when and why system has intelligence and how to measure progress in AI we as a community have come a long way in building systems that perceive and understand content be it text, , conversations, images, videos and so on. The progress has accelerated even more in the past few years with the increase in data, compute and capacity to train bigger models. But, taking these capabilities from the research and academic world and making them truly work at scale has always been a challenge. While the previous generation challenges were more in the realm of making systems work right now we face the challenge of making these systems work at scale. There is an evident lag in capabilities we share in the research community and in capabilities that are deployed in the real world. The gap is due to various challenges we hit due to scale. With every challenge there is an equally interesting opportunity. In this talk I want to highlight some of these challenges and opportunities and especially focus on where the systems break. While I will try my best to span as many systems as possible, I will have a more concrete focus on Computer Vision problems.

  • Pulkit Agrawal, Learning to forecast and control from visual inputs
    The ability to forecast how different objects will be affected by the applied action (i.e. intuitive physics), is likely to be very useful for executing novel manipulation tasks. Similarly, the ability to forecast how humans will act in the future (i.e. intuitive behavior) can enable an agent to plan its interactions with humans. I will present results of some preliminary investigations on building intuitive models of physics ( , and intuitive models of behavior (in context of water polo games) directly from visual inputs.

  • Santosh Divvala, Visual Knowledge Extraction and Reasoning
    Visual Knowledge is an important piece of the Scene Understanding puzzle. Extracting rich visual knowledge from different modalities (images, videos, text, diagrams, etc.) and leveraging it towards efficient visual reasoning is a core focus of our group at the Allen Institute for Artificial Intelligence (AI2). This talk will cover some of our exciting Visual Knowledge Extraction & Reasoning frameworks.

  • Subhransu Maji, Reasoning about 3D shapes from their projections
    We live in a three-dimensional world, yet all we see are its projection on to a two-dimensional space. In this talk, I'll present some recent work on analyzing 3D shapes from their projections. The first part of the talk is about categorizing and segmenting 3D shapes using view-based techniques. This allows us to use image representations learned on large labelled datasets for 3D shape analysis where training data is limited. The second part of the talk tackles the problem of inferring 3D shapes of objects given a collection of their views. Imagine walking into a showroom containing a variety of chairs each at a different (unknown) orientation. Can we infer the 3D shape of each chair from a single glimpse? Note that not all chairs are the same. Some might have armrests while others don't, the number of legs can vary from one chair to another, etc., hence standard structure-from-motion (SFM) techniques may fail. Moreover, can we do this without having a prior model of 3D chairs (hence hard for non-rigid SFM), or a model for predicting depth or viewpoint (i.e., completely unsupervised). We present an approach that learns a generative model over 3D shapes by combining a 3D shape generator with a projection module. This allows us to induce a 3D shape distribution given a collection of views of an arbitrary set of objects, e.g., we can combine views of airplanes, cars, and chairs to infer their shapes. We show that the model implicitly learns reasons about part correspondence and viewpoints without explicit supervision.

  • Vineet Gandhi, Beyond the Ken Burns effect - searching new avenues for storytelling
    Surge of high resolution cameras have opened interesting avenues for virtual cinematography and editing directly in the video space. In this talk, I will explain how the traditional work in human tracking and camera stabilisation can be extended for improved visualizations and content aware displays from static high resolution recordings. Specifically, I will talk about our findings along two novel directions of (a) automated multi clip editing and (b) split screen visualizations. I will show, how these techniques can significantly enhance the viewing experience, especially in the context of recorded staged performances.