Learning From Weakly Labelled Data
I am highly interested in developing computer vision models which use weakly labelled dataset. Although a huge amount of image and video data is easily available now-a-days but labeling those data require tremendous amount of human effort. So most of these image and video data are either weakly labelled or not labelled at all. It will be interesting to see whether it is possible to utilize all these weakly labelled and unlabelled visual data to train models which will be as powerful as the models build by limited amount of fully labelled data.
Zero-Shot Learning
I am also fascinated about the idea of zero-shot learning where the training classes and testing classes have zero intersection. Human can identify a completely new object by comparing it with previously known objects and their contextual relationship with the unknown object. The contextual relationship can be extracted from various sources such as attribute annotations of the classes, textual knowledge base, hierarchical representation of the classes etc. As abundant textual knowledge is already available and recent advancements in representing text by semantic vector using completely unsupervised approach have showed promising result in several computer vision fields thus I am using textual data to transfer knowledge from known classes to identify unknown class objects. But contextual knowledge gathered from textual data can be very confusing and misleading. It can show strong correspondence between two objects in such a context which is utterly irrelevant in computer vision problems. It will be interesting to see if a model can understand the textual contexts in a better way so it can pick only the useful and less confusing information from them.
Image/Video to Text Mapping
Recently there have been a lot of interesting works on bi-directional mapping of image/video data and text data. The ability to express a video or image using natural human language proves both the models high level capability to understand the visual data and its proficiency in relating visual cues with natural language fragments in a semantically meaningful way. These works pave the way to build such models which can better understand complex scenes from both visual data and their corresponding textual data and use this capability to describe or label images or video data with previously unseen scenes or objects just like a human will label them. Also visual data and textual data often complements each other to fully understand a concept ( reading books with figures). I think it will be interesting to see whether a model can relate visual data with the corresponding text data without any human intervention to learn new concepts more profoundly than learning them only from visual data or textual data.