Salient Points
In this work, we address ZSL representation problem using three main ideas.
First, we turn to clustering, and use the centroids of the clusters to represent a video due to their ability to regularize the representations.
Second, our representation is a combination of a visual and a semantic representation since both visual and semantic information can complement each other.
Third, we use the signal from classification as direct supervision for clustering, by using Reinforcement Learning (RL). Specifically, we use the REINFORCE algorithm to directly update the cluster centroids.
The results are remarkable improvements across datasets and tasks over all previous state-of-the-art, up to 11.9% absolute improvement on HMDB51 for GZSL.