Zero-Shot Learning utilizes semantics with visual features to make predictions on classes that are not present in the training dataset[8, 7]. Semantic information is commonly given by textual descriptions, word embeddings or manually annotated attributes. The widely investigated idea to enable Zero-Shot Learning is to learn a compatibility function between the visual features and the semantic attributes. There are existing methods that learn a projection from visual to semantic space[5], or a semantic to visual space[9], or projecting both features into a common space[1]. One challenge that these models encounter is that they are likely to suffer from bias towards the known classes, therefore often mis-classify unseen as seen classes. Generative models try to address this problem by synthesizing features for the unseen classes by learning from the visual features and semantic information[6].
Zero-Shot Detection is a more challenging problem which aims to simultaneously predict a bounding box for the unseen object in the image [4]. Previous method projects visual features to semantic space and then utilizes nearest neighbor search in the semantic space to detect novel objects [4]. As unseen objects suffer from bias towards the seen objects, the detection results might have problems with missing unseen objects or mis-locating the backgrounds [6]. Generative methods for Zero-Shot Detection usually augment visual features for unseen classes to tackle this problem [6, 10].
In addition to static images, video input provides temporal context for object detection in the scene. Methods that use flow tracking, correlation between two adjacent time steps, 3D convolutions or recurrent networks have been explored to take advantage of the contextual information. However these approaches either rely on a dense, regular sampling of time steps or have a relatively small temporal receptive field. Attention-based architectures can aggregate long term contextual information at the object level and improve detection results across time [2].