Salient Points

  • In this work, we address ZSL representation problem using three main ideas.

  • First, we turn to clustering, and use the centroids of the clusters to represent a video due to their ability to regularize the representations.

  • Second, our representation is a combination of a visual and a semantic representation since both visual and semantic information can complement each other.

  • Third, we use the signal from classification as direct supervision for clustering, by using Reinforcement Learning (RL). Specifically, we use the REINFORCE algorithm to directly update the cluster centroids.

  • The results are remarkable improvements across datasets and tasks over all previous state-of-the-art, up to 11.9% absolute improvement on HMDB51 for GZSL.